CN117532633A

CN117532633A - Language interactive robot capable of serving user

Info

Publication number: CN117532633A
Application number: CN202311800839.9A
Authority: CN
Inventors: 王旭; 顾一啸; 邓又豪; 麦均; 谢超
Original assignee: Guizhou University
Current assignee: Guizhou University
Priority date: 2023-12-22
Filing date: 2023-12-22
Publication date: 2024-02-09

Abstract

The application discloses a language interactive robot that can serve a user, comprising: a frame; a lidar sensor located above the gantry; the RGB-D camera is positioned on one side of the rack, and the laser radar sensor and the RGB-D camera can sense the environment, so that a user can customize the working range of the robot; a WIFI module located at the other side of the frame opposite to the RGB-D camera; the GPT model module is used for interacting with a user and comprises a microphone for receiving the voice of the user; a speaker for playing the voice of the robot; the ASR speech-to-text module is used for converting the speech of the user into text as input data; the TTS text-to-speech module is used for converting the text generated by the robot into the speech of the robot for being played by a loudspeaker; a processor and a memory, wherein the memory has stored thereon computer program instructions. The robot can realize intelligent and efficient voice interaction with the user and can perform more accurate mobile navigation.

Description

Language interactive robot capable of serving user

Technical Field

The present application relates generally to the technical field of intelligent service robots, and more particularly to a language interactive robot capable of serving users.

Background

With the increasing aging of the population, the demand of the society for services for the elderly is also increasing. In the field of robotics, since the beginning of the 90 s, a wide range of projects have focused on developing "in-situ care" robotic applications, extending the range of such applications from general healthcare to home care. However, even with thirty years of activity and many prototypes, there is no commercialized robot that can support independent care in the home. This is mainly due to the inherent complexity of the tasks that the robot must perform, the unpredictability of the home environment, and the unstructured interactions with the user. The intelligent degree is not high, the intelligent service robot is always an industry bottleneck, the intelligent service robot is developed for years in China, and the short board of the natural language processing technology is always a serious limitation. In view of this, it is a problem of the skilled person how to provide a certain dialogue understanding capability and develop a more efficient robot movement control system. Thus, our goal is to design a: 1. being able to understand the language and behavior of a person; 2. specific tasks can be completed according to the needs of people; 3. the intelligent service type robot which can be self-optimized and accords with the environment-friendly concept and is used for serving the old and can perform natural voice interaction.

Disclosure of Invention

In view of the foregoing technical problems, the present disclosure proposes a language interactive robot capable of serving a user, wherein the robot includes: a frame; the laser radar sensor is positioned above the rack; an RGB-D (Red Green Blue-Depth) camera, the RGB-D camera being located on one side of the frame, the lidar sensor and the RGB-D camera being capable of environmental sensing enabling a user to customize the working range of the robot; the WIFI module is positioned on the other side of the rack opposite to the RGB-D camera, and the robot can perform wireless communication with a remote server through the WIFI module; a GPT (Generative Pretrained Transformer), generating a pre-trained model) model module for interacting with the user; the GPT model module comprises a microphone for receiving the voice of the user; the loudspeaker is used for playing the voice of the robot; an ASR (Automatic Speech Recognition ) speech-to-text module for converting the received speech of the user into text as input data; a TTS (Text To Speech) Text To Speech module for converting the robot-generated Text To the robot's Speech for the speaker To play; a processor and a memory, wherein the microphone, the speaker, the ASR speech to text module, the TTS text to speech module, and the processor and memory are located together inside the housing, and wherein the memory has stored thereon computer program instructions that, when executed by the processor, implement a method for serving a user, the method comprising the steps of: s1, receiving a voice command of the user through the microphone; s2, the voice command of the user is processed by the ASR voice-to-text module so as to be converted into text; s3, the converted text is sent to a ChatGPT of a cloud server, an answer generated by the ChatGPT is returned to the robot, and then the answer is converted into voice through a TTS text-to-voice module and is played to the user through the loudspeaker; s4, acquiring three-dimensional image information of an active area by using the laser radar sensor and the RGB-D camera; s5, acquiring environment information by using a Gapping algorithm based on the three-dimensional image information; s6, searching a path space and determining a moving path by using an RRT (rapid-exploring Random Tree) algorithm based on the environment information; s7, enabling the robot to move to a target position along the moving path; and S8, performing feature recognition on the object at the target position by using a yolo7 algorithm.

In a preferred embodiment, the robot communicates wirelessly with a wireless remote server through the WIFI module using SSH (Secure Shell protocol) network protocol.

In a preferred embodiment, the robot further comprises a robot arm mounted on an upper portion of the frame for gripping the article.

In a preferred embodiment, a three-dimensional map of the active area is constructed in S4 using the acquired three-dimensional image information by Gmapping algorithm.

In a preferred embodiment, the processor is a JETSON-NANO-DEV-KIT-A artificial intelligence host controller.

In a preferred embodiment, a drive motor is also included, which is controlled by a motor driver.

In a preferred embodiment, the motor driver controls the speed of the drive motor by outputting a PRM (Parameters) signal through a CPI0 pin, and controls the rotational speed direction of the drive motor by the duty cycle of the PRM signal.

In a preferred embodiment, the lidar sensor is used in S6 to determine if an occlusion is present within the active area, and if the occlusion is present, to determine if the size of the occlusion is greater than a threshold, and to determine an occlusion greater than the threshold as an obstacle.

In a preferred embodiment, the method further comprises: s9, after the robot reaches the target position, requesting the user to confirm whether the target position is reached or whether the object at the target position is acquired through the microphone.

In a preferred embodiment, the robot further comprises a display, wherein the display displays dialogue text between the user and the robot to the user.

In a preferred embodiment, the robot is capable of learning the user's usage habits over time, including but not limited to user instructions corresponding to usage time and location.

In a preferred embodiment, the method further comprises: and S10, according to the learned use habit of the user, when the robot reaches the corresponding position and does not receive a user instruction, prompting the user according to the learned use habit and instruction of the user, and interacting with the user.

Compared with the prior art, the beneficial effects of the present disclosure are: the method can overcome the long-standing limit of the natural language processing technology, achieves more intelligent and efficient voice interaction between the user and the robot by accessing the pre-trained language model, is more intelligent by learning the use habit of the user along with the increase of the use times, and can automatically move and operate in the actual environment more accurately according to the instruction of the user to complete the corresponding task.

Drawings

The novel features believed characteristic of the application are set forth with particularity in the appended claims. A better understanding of the features and advantages of the present application will be obtained by reference to the following detailed description that sets forth illustrative embodiments, in which the principles of the present application are utilized, and the accompanying drawings. The drawings are only for purposes of illustrating embodiments and are not to be construed as limiting the application. Also, like elements are denoted by like reference numerals throughout the drawings, wherein:

FIG. 1 illustrates a schematic diagram of a language interactive robot capable of serving a user according to an exemplary embodiment of the present disclosure;

FIG. 2 illustrates a flowchart of a method for a bot to perform servicing a user in accordance with an exemplary embodiment of the present disclosure;

FIG. 3 illustrates a schematic diagram of a robot constructed three-dimensional map according to an exemplary embodiment of the present disclosure;

FIG. 4 illustrates a schematic diagram of a remote server remotely controlling a robot according to an exemplary embodiment of the present disclosure; and

fig. 5 shows a schematic diagram of performing training on a robot according to an exemplary embodiment of the present disclosure.

Reference numerals illustrate: 1. the laser radar sensor comprises a laser radar sensor, a RGB-D camera, a WIFI module, a microphone, a loudspeaker, a driving motor, a processor, a rack and a frame, wherein the laser radar sensor, the RGB-D camera, the WIFI module, the microphone, the loudspeaker, the driving motor, the processor and the rack are arranged in sequence.

Detailed Description

Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art. Nothing in the following detailed description is intended to indicate that any particular component, feature, or step is essential to the application. Those of skill in the art will understand that various features or steps may be substituted for or combined with one another without departing from the scope of the disclosure.

Fig. 1 shows a schematic diagram of a language interactive robot capable of serving a user according to an exemplary embodiment of the present disclosure. The present disclosure proposes a language interactive robot (i.e. service robot ServiceGPT) capable of serving a user (e.g. navigation, fetching, chat, etc.), characterized in that the robot may comprise: the laser radar sensor 1 and the RGB-D camera 2, the WIFI module 3, the microphone 4, the loudspeaker 5, the ASR voice-to-text module, the TTS text-to-voice module, the processor 7 and the memory and the rack 8. The lidar sensor 1 may be located above the gantry 8. The RGB-D camera 2 may be located at one side of the housing 8. The lidar sensor 1 and the RGB-D camera 2 are capable of environmental sensing, enabling a user to customize the working range of the robot and to autonomously navigate the robot in an indoor environment to avoid obstacles. In a preferred embodiment, the lidar sensor 1 may be a mist RPLIDAR A1 radar and the RGB-D camera 2 may be a Jetson Nano camera IMX219. In some other embodiments, the lidar sensor 1 and RGB-D camera 2 may be any other type of radar sensor and camera deemed appropriate by those skilled in the art. The robot can communicate wirelessly with a remote server through the WIFI module 3. The WIFI module 3 may include a plurality of antennas. The WIFI module 3 may be located at the other side of the housing 8 opposite to the RGB-D camera 2. Preferably, the WIFI module 3 may include two antennas, which are both located at one side of the robot. In a preferred embodiment, the robot may communicate wirelessly with a remote server through the WIFI module 3 using an SSH network protocol. For example, an operator may perform remote control operations on the robot in a remote control room through a wireless remote server, such as performing navigation for the robot, viewing the environment around the robot photographed by the RGB-D camera 2 of the robot in real time through a display of the remote server, performing cloud training for a generated pre-trained natural language interaction model in the robot, and so on. In addition, the SSH network protocol can enable the robot to be remotely connected to the server and rapidly process the sudden event of the robot, and ensure high stability of communication. The robot may further comprise a microphone 4, which microphone 4 may be used to receive the speech of the user, e.g. the received user's speech may be input to an ASR speech-to-text module for further processing. In a preferred embodiment, the microphone 4 may be a dual MEMS (Micro-Electro-Mechanical System, microelectromechanical system) silicon microphone. In some other embodiments, the microphone 4 may be any other type of microphone deemed appropriate by a person skilled in the art. The robot may further comprise a speaker 5, which speaker 5 may be used for playing the voice of the robot. The bot may also include an ASR speech to text module that may be used to convert received speech of a user into text as input data. The robot may further comprise a TTS text to speech module, which may be used to convert robot generated text to speech of the robot for playing by the speaker 5. For example, a TTS text-to-speech module may be used to convert the text processed by the processor (i.e., text output by the fine-tuned and improved supervisory-tuning model) into the robot's speech for playback by the speaker 5. Typically, the ASR speech to text module, the TTS text to speech module, and the processor may be integrated. For example, the ASR speech to text module, the TTS text to speech module, and the processor 7 and memory may be integrated in the same substrate within the housing 8. The substrate may be located in the middle of the robot. Based on the traditional robot voice interaction, the dialogue management module (such as an ASR voice-to-text module and a TTS text-to-voice module) is connected to the pre-trained language model according to the embodiment of the invention, so that the intelligent degree of the robot is improved, and a user has better language interaction experience. The robot may further comprise a GPT model module, which may be used for interacting with the user. In some embodiments, the GPT model of the present invention may be trained after building a framework and putting into a large-scale dialog dataset using microsoft corporation's deep speed chat platform, as the GPT model requires a lot of computing resources and time. The GPT model is combined with the service type robot, so that more efficient, convenient and comfortable service experience can be provided. For example, the GPT model module of the present invention may further perform cloud feedback optimization, that is, based on a large database, further train the GPT model module by using data obtained by the GPT model module, and continuously perform feedback correction during operation, so as to improve the exclusive experience of the user. The robot can also realize real-time feedback and model update of users, can upload data collected in real time to the cloud for real-time analysis and decision making so as to obtain more accurate feedback and guidance, and simultaneously supports key links such as data collection, model training, distributed learning, model update and the like. In a preferred embodiment, the robot may further comprise a drive motor 6. The driving motor may be located at a lower portion of the robot, and the above-described substrate may be placed above the driving motor 6. Wheels may be attached at both ends of the driving motor 6. The wheels can be driven to rotate when the driving motor 6 rotates so that the robot can move. The drive motor 6 may be controlled by a motor driver. In a preferred embodiment, the motor driver may control the speed of the drive motor 6 by outputting a PRM signal via a CPI0 pin, and may control the rotational speed direction of the drive motor 6 by the duty cycle of the PRM signal. In a preferred embodiment, the robot may further comprise a display. The display may display text of a dialog of the user with the robot to a user. In some embodiments, the display may be a CRT display, an LCD display, an LED display, a 3D display, a plasma display, etc., or any other type of display deemed appropriate by one of ordinary skill in the art. The robot may further comprise a processor 7 and a memory, wherein the memory has stored thereon computer program instructions which, when executed by the processor, implement a method for the robot to perform serving a user. In a preferred embodiment, the processor 7 may be a JETSON-NANO-DEV-KIT-A artificial intelligence host controller. In a preferred embodiment, the robots of the present application may use ROS robotic systems, where a brain design may be employed, with the brain using JETSON-NANO-DEV-KIT-A artificial intelligence master controller for related processing of artificial intelligence, and the robot's cerebellum using raspberry-pie RP2040 microcontroller for related processing other than artificial intelligence. In some other embodiments, the processor 7 may be any other type of processor deemed appropriate by a person skilled in the art. In some embodiments, the robot may also include an IMU (Inertial Measurement Unit ) attitude sensor that is driven with DCM (Direct Current Machine, dc motor) and that controls speed with PID (Proportional Integral Derivative, proportional-integral-derivative) to achieve a tachometer output of the wheel. In some embodiments, the robot further includes a robot arm mounted on an upper portion of the frame for gripping an article.

Fig. 2 illustrates a flowchart of a method for a robot to perform serving a user according to an exemplary embodiment of the present disclosure. The present disclosure also proposes a method for a robot to perform serving a user, which may include the following S1-S8. At S1, a voice command of the user may be received through the microphone 4. At S2, the user' S voice command may be processed via the ASR speech-to-text module to convert the voice command to text. The ASR system will convert the user's speech input into text form so that the computer can understand and process. At S3, the converted text may be sent to a language model of the cloud server, which may be ChatGPT (or similar language model), and the answer generated by the language model is returned to the robot, then converted to speech by the TTS text-to-speech module, and played to the user by the speaker. At S4, three-dimensional image information of an active area may be acquired using the lidar sensor and the RGB-D camera. Fig. 3 shows a schematic diagram of a robot constructed three-dimensional map according to an exemplary embodiment of the present disclosure. At S5, environmental information may be acquired using a Gmapping algorithm based on the three-dimensional image information. In a preferred embodiment, a three-dimensional map of the active area may be constructed by a real-time three-dimensional mapping SLAM (Simultaneous Localization and Mapping) algorithm using the acquired three-dimensional image information. In a preferred embodiment, the lidar sensor 1 may be used to determine whether an obstruction is present within the active area, if so, determine whether the size of the obstruction is greater than a threshold, and determine an obstruction greater than the threshold as an obstruction. At S6, a path space may be searched and a moving path may be determined using an RRT algorithm based on the environment information. In a preferred embodiment, the lidar sensor 1 may be used to determine whether an obstruction is present within the active area, if so, determine whether the size of the obstruction is greater than a threshold, and determine an obstruction greater than the threshold as an obstruction. At S7, the robot may be moved to a target position along the movement path. At S8, a yolo7 algorithm may be used to identify features of the item at the target location. In a preferred embodiment, the method may further comprise: s9, after the robot reaches the target position, requesting the user to confirm whether the target position is reached or whether to acquire an article at the target position through the microphone 4. In a preferred embodiment, the robot is capable of learning the user's usage habits over time, including but not limited to user instructions corresponding to usage time and location. In a preferred embodiment, the method further comprises: and S10, according to the learned use habit of the user, when the robot reaches the corresponding position and does not receive a user instruction, prompting the user according to the learned use habit and instruction of the user, and interacting with the user.

Fig. 4 illustrates a schematic diagram of a remote server remotely controlling a robot according to an exemplary embodiment of the present disclosure. In a preferred embodiment, the lidar sensor 2 and RGB-D camera 3 construct a three-dimensional environment map in motion using a real-time three-dimensional mapping SLAM algorithm, and estimate the position and pose of the robot in the map for autonomous navigation. In a preferred embodiment, the lidar sensor 1 and RGB-D camera 2 sense 3D scene and 2D image information of the environment to accurately identify, locate and track objects. The robot may activate the camera to capture images and compress the images for transmission to a remote server via WIFI module 3, which may accept the compressed images and process them accordingly. For example, the robot may communicate wirelessly with a remote server using an SSH network protocol, so that an operator may use the wireless remote server to remotely control the robot, view the environment surrounding the robot in real time through a display, and so on.

Fig. 5 shows a schematic diagram of performing training on a robot according to an exemplary embodiment of the present disclosure. For example, the data to be input may be first marked, and then the marked data 11 is input into a pre-trained model 12 to be trained to obtain a Supervised Tuning model 13 (i.e., SFT (Supervised Fine-Tuning) model). Learning directly from the bonus signal can become very difficult due to the complexity and uncertainty of the real-world environment. Therefore, reinforcement learning with human feedback is an effective method. Reinforcement learning from human feedback takes human feedback as an additional signal to guide learning of an intelligent agent, so that the human feedback helps us to generate a high-quality generation type natural language interaction model. For example, training a natural language interaction model requires a lot of computing resources and time, and for this purpose the present invention uses an existing platform, i.e. the pre-trained model is a model pre-trained by the existing platform and pre-installed into the robot of the present invention to process the input data received by the ASR speech to text module and output the processed or generated result to the TTS text to speech module for playing by the speaker 5. After the framework is built and a large-scale dialogue dataset is put in, the platform can realize end-to-end RLHF (Reinforcement Learning from Human Feedback ) training of a natural language interaction model, namely reinforcement learning by utilizing human feedback. The supervised tuning model obtained after training may include an exponential moving average 14, an actor model 15, and a freeze reference model 16. The generated natural language interaction model is combined with the service type robot, so that more efficient, convenient and comfortable service experience can be provided. Second, the output dataset of the supervised tuning model may be scored (i.e., paired good/bad answers 17) and then input into the pre-trained model 12 for training to obtain a RW (Reward) model 19, which is different from the pre-trained model 12 described above. In some embodiments, the criticism model 20 may be obtained in addition to the independent consideration model. Third, the supervisory tuning model may be further fine tuned and improved based on rewards feedback from the reward model 19. Reinforcement learning may be applied to tune the SFT model by optimally training the reward model in general. For example, the supervised tuning model may be further tuned and improved using a PPO (Proximal Policy Optimization, near-end policy optimization) algorithm (which is performed by deep speed 22) by inputting data 21 generated using an Actor (Actor) model or pre-training data 24 or pre-training targets 23 of the Actor to an existing platform (i.e., deep speed 22) based on rewards feedback of the reward model, criticizer model, actor model, or freeze reference model. The fine-tuned and improved supervisory tuning model may output the resulting results and output the text of the resulting results to a TTS text-to-speech module for converting the text of the resulting results to the robot's speech for playback by the speaker 5.

Most service robots at present can only provide voice broadcasting or text interaction and the like, and the intelligent degree is not high. In order to enhance convenience and applicability of design, natural language interaction between users is an indispensable technology, and voice interaction technology needs to have characteristics of high intelligence, accurate recognition of user intention, strong adaptability and the like. Compared with the prior art, the language model can realize end-to-end RLHF training of the model, namely, reinforcement learning is performed by utilizing human feedback. The robot can overcome the long-standing limit of natural language processing technology, achieves more intelligent and efficient voice interaction between a user and the robot by accessing the pre-trained language model, has more accurate mobile navigation, can sense 3D scene and 2D image information in an actual environment more accurately, can achieve large-scale complex calculation and AI reasoning, can move and operate in the actual environment more accurately according to instructions of the user, and completes corresponding tasks.

It should be understood that the systems and/or methods of the various embodiments provided in the present disclosure may be combined, modified and/or altered to form new solutions. Such solutions should also be included within the scope of the invention as claimed without inventive effort.

Numerous specific examples are provided in the embodiments provided herein, with the understanding that these examples are set forth merely to illustrate embodiments of the invention and are not intended to limit the invention. Embodiments of the invention may be practiced without these specific examples. Methods, structures and/or techniques well known to those skilled in the art have not been shown in detail in some embodiments so as not to obscure the understanding of this description.

While preferred embodiments of the present invention have been shown and described herein, it will be obvious to those skilled in the art that such embodiments are provided by way of example only. Numerous variations, changes, and substitutions will now occur to those skilled in the art without departing from the invention. It should be understood that various alternatives to the embodiments of the invention described herein are optionally employed in practicing the invention. It is intended that the following claims define the scope of the invention and that methods and structures within the scope of these claims and their equivalents be covered thereby.

Claims

1. A language interactive robot capable of serving a user, the robot comprising:

a frame;

the laser radar sensor is positioned above the rack;

the RGB-D camera is positioned on one side of the rack, and the laser radar sensor and the RGB-D camera can sense the environment, so that a user can customize the working range of the robot;

the WIFI module is positioned on the other side of the rack opposite to the RGB-D camera, and the robot can perform wireless communication with a remote server through the WIFI module;

the GPT model module is used for interacting with the user;

the GPT model module comprises a microphone for receiving the voice of the user;

the loudspeaker is used for playing the voice of the robot;

an ASR speech to text module for converting the received speech of the user into text as input data;

the TTS text-to-speech module is used for converting the text generated by the robot into the voice of the robot for playing by the loudspeaker;

a processor and a memory, wherein the microphone, the speaker, the ASR speech to text module, the TTS text to speech module, and the processor and memory are located together inside the housing, and wherein the memory has stored thereon computer program instructions that, when executed by the processor, implement a method for serving a user, the method comprising:

s1, receiving a voice command of the user through the microphone;

s2, the voice command of the user is processed by the ASR voice-to-text module so as to be converted into text;

s3, the converted text is sent to a ChatGPT of a cloud server, an answer generated by the ChatGPT is returned to the robot, and then the answer is converted into voice through the TTS text-to-voice module and is played to the user through the loudspeaker;

s4, acquiring three-dimensional image information of an active area by using the laser radar sensor and the RGB-D camera;

s5, acquiring environment information by using a Gapping algorithm based on the three-dimensional image information;

s6, searching a path space and determining a moving path by using an RRT algorithm based on the environment information;

s7, enabling the robot to move to a target position along the moving path; and

and S8, performing feature recognition on the object at the target position by using a yolo7 algorithm.

2. The robot of claim 1, wherein the robot communicates wirelessly with a wireless remote server via the WIFI module using SSH network protocols.

3. The robot of claim 1, further comprising a robotic arm mounted on an upper portion of the frame for gripping an item.

4. The robot of claim 1, wherein the three-dimensional map of the active area is constructed using the acquired three-dimensional image information through a Gmapping algorithm in S4.

5. The robot of claim 1, wherein the processor is a JETSON-NANO-DEV-KIT-a artificial intelligence master controller.

6. The robot of claim 1, further comprising a drive motor, the drive motor being controlled by a motor driver.

7. The robot of claim 6, wherein the motor driver controls the speed of the driving motor by outputting a PRM signal through a CPI0 pin, and controls the rotational speed direction of the driving motor by a duty cycle of the PRM signal.

8. The robot of claim 1, wherein in S6 it is determined whether a shade is present in the active area using the lidar sensor, if the shade is present, it is determined whether the size of the shade is greater than a threshold, and a shade greater than the threshold is determined as an obstacle.

9. The robot of claim 1, wherein the method further comprises: s9, after the robot reaches the target position, requesting the user to confirm whether the target position is reached or whether the object at the target position is acquired through the microphone.

10. The robot of claim 1, further comprising a display that displays to a user text of a conversation of the user with the robot.

11. The robot of claim 1, wherein the robot is capable of learning usage habits of the user over time of use, including but not limited to user instructions corresponding to time of use and location.

12. The robot of claim 11, wherein the method further comprises: and S10, according to the learned use habit of the user, when the robot reaches the corresponding position and does not receive the user instruction, prompting the user according to the learned use habit and instruction of the user, and interacting with the user.