CN116572260A - Emotion communication accompanying and nursing robot system based on artificial intelligence generated content - Google Patents

Emotion communication accompanying and nursing robot system based on artificial intelligence generated content Download PDF

Info

Publication number
CN116572260A
CN116572260A CN202310274857.1A CN202310274857A CN116572260A CN 116572260 A CN116572260 A CN 116572260A CN 202310274857 A CN202310274857 A CN 202310274857A CN 116572260 A CN116572260 A CN 116572260A
Authority
CN
China
Prior art keywords
module
robot
information
subsystem
data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310274857.1A
Other languages
Chinese (zh)
Inventor
禹鑫燚
李元龙
江晨炘
陆利钦
俞俊鑫
欧林林
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang University of Technology ZJUT
Original Assignee
Zhejiang University of Technology ZJUT
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang University of Technology ZJUT filed Critical Zhejiang University of Technology ZJUT
Priority to CN202310274857.1A priority Critical patent/CN116572260A/en
Publication of CN116572260A publication Critical patent/CN116572260A/en
Pending legal-status Critical Current

Links

Classifications

    • BPERFORMING OPERATIONS; TRANSPORTING
    • B25HAND TOOLS; PORTABLE POWER-DRIVEN TOOLS; MANIPULATORS
    • B25JMANIPULATORS; CHAMBERS PROVIDED WITH MANIPULATION DEVICES
    • B25J11/00Manipulators not otherwise provided for
    • B25J11/0005Manipulators having means for high-level communication with users, e.g. speech generator, face recognition means
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02PCLIMATE CHANGE MITIGATION TECHNOLOGIES IN THE PRODUCTION OR PROCESSING OF GOODS
    • Y02P90/00Enabling technologies with a potential contribution to greenhouse gas [GHG] emissions mitigation
    • Y02P90/02Total factory control, e.g. smart factories, flexible manufacturing systems [FMS] or integrated manufacturing systems [IMS]

Abstract

An emotion communication accompanying and nursing robot system for generating content based on artificial intelligence. The system comprises an information acquisition device, a data analysis processing device and a man-machine interaction cooperation device. The data analysis processing system mainly comprises an information processing subsystem, a danger sensing subsystem, a voice processing subsystem, a digital human subsystem and a robot control subsystem. The elderly may communicate with the robot or use voice requests for assistance. The robot is provided with environmental information such as microphone, monocular phase collection and the like through information acquisition equipment. The collected information is processed and decided through the data analysis processing system, and the man-machine interaction cooperative equipment is controlled, wherein the man-machine interaction cooperative equipment comprises a display, a loudspeaker, a mechanical arm, a bottom pulley and the like, so that the emotion communication and nursing functions with the old are realized. The invention realizes the emotion communication and accompanying function of the aged robot and the aged, and overcomes the defect that the existing related products are difficult to effectively communicate with the aged.

Description

Emotion communication accompanying and nursing robot system based on artificial intelligence generated content
Technical Field
The invention particularly relates to an emotion accompanying and nursing robot system integrating an Artificial Intelligence Generated Content (AIGC) technology, a digital man technology and a robot technology.
Background
The world's trend of aging is accelerating, and the population of the elderly in China is increasing, and it is expected that the population of people aged 65 years and older in China will occupy more than one third of the general population by 2050, which undoubtedly brings greater pressure and challenges to families and society. The aged people industry can be divided into three fields of community aged people, institutional aged people and family aged people as an inevitable product of social development. The community endowment service construction of China is still in the primary stage, and no complete supporting facilities exist; in the aspect of institutional care, although more care homes exist, a series of problems of uneven care service quality, large gap in management level and the like exist; the home care brings a certain pressure to the young people, and the old people need to be cared when working, and the old people can live alone and are inconvenient.
Improving the self-care ability of the elderly through external assistance is an innovative approach. At present, external auxiliary means are provided with intelligent sensors capable of measuring and uploading physical index data such as blood pressure and heart rate of the old people to the cloud, so that doctors and families can judge the physical state of the old people through the data. In addition, the intelligent home is also provided for helping the old to control equipment such as home appliances, curtains and the like in a language interaction mode and the like, so that the living convenience of the old is improved. Therefore, future pension services will increasingly depend on technology, and the robot is combined with the pension services, so that not only can personalized services be customized according to individual requirements, but also the life quality of the aged can be improved, and the pension service level can be improved.
Disclosure of Invention
Aiming at the problem that the prior related technology is difficult to effectively communicate with the old, the invention combines the artificial intelligence content generation technology (such as a large-scale language model) with the digital man technology, and provides an emotion communication accompanying robot system by means of the robot technology, thereby realizing emotion communication and care functions with the family old.
The emotion communication accompanying and nursing robot system based on the artificial intelligence generated content comprises information acquisition equipment, data analysis processing equipment and man-machine interaction cooperation equipment.
The main function of the information acquisition device is to acquire environmental data through sensors. The information acquisition apparatus mainly includes a monocular/binocular camera and a microphone. Monocular/binocular cameras are mainly used to acquire scene information around the robot, including the user and scene images. The main function of the microphone is to collect voice information of the user. The information acquisition device sends the acquired scene information and voice information to the data analysis processing device for processing.
The man-machine interaction cooperation device has the main function of interacting with people and is a result execution link of the data analysis processing device. The man-machine interaction cooperation device mainly comprises a display, a loudspeaker, a mechanical arm and a bottom motion mechanism of the robot. The man-machine interaction cooperative equipment executes the processing result of the data analysis processing equipment: displaying the digital person image of the designated parent using the display; playing the synthesized sound with the designated parent tone by adopting a loudspeaker; the mechanical arm is used for performing interactive cooperation with a user; the robot has autonomous and following movement functions.
The data analysis processing equipment is the core of the whole system, and the main functions of the data analysis processing equipment are to process the data collected by the analysis sensor and control the man-machine interaction equipment to execute the corresponding functions for realizing the design. The data analysis processing device consists of a data analysis processing system loaded on an onboard host of the robot. The whole data analysis processing system consists of an information processing subsystem, a danger sensing subsystem, a voice processing subsystem, a digital human subsystem and a robot control subsystem.
The information processing subsystem receives the image data and transmits the image data to the digital human subsystem, the robot control subsystem and the danger sensing system respectively. In addition, the information processing subsystem receives text data generated by the voice processing subsystem, a scene model generated by the robot control subsystem and human body posture data generated by the danger sensing subsystem, performs information fusion processing and makes decisions, and transmits control instructions of the robot to the robot control subsystem. The information processing subsystem comprises an image data processing module and an information fusion processing module:
the main function of the image data processing module is to perform preliminary processing on received image data by using a deep convolution network. In the robot setting stage, the image data processing module receives a parent image required for establishing a digital person and transmits the parent image to the digital person subsystem to establish a digital person model. In the running process of the robot, the image data processing module classifies the user image and the environment image by utilizing a depth network and respectively transmits the user image and the environment image to the hazard perception subsystem and the robot control subsystem for processing;
the main function of the information fusion processing module is to analyze and process the data of different modes and make decisions. The information fusion processing module processes text instructions generated by the voice processing subsystem by utilizing the deep network, combines pose estimation data and scene data of a user, generates key space position information in the robot movement process and transmits the key space position information to the robot control subsystem. If the user is in a dangerous condition, the information fusion module generates motion key point information processed by the robot according to the current user gesture estimation data and scene data and transmits the motion key point information to the robot control subsystem.
The danger sensing subsystem receives the user image processed by the information processing subsystem and carries out pose estimation, and the danger sensing subsystem is used for judging whether the pose and the position of the user are in danger or not according to scene data. If the user is in a dangerous state, the danger sensing subsystem sends a robot control instruction to the robot control subsystem and transmits information to the information processing subsystem. The danger sensing subsystem comprises a gesture estimation module and a danger processing module:
the main function of the gesture estimation module is to estimate the gesture of the user according to the input human body image. The human body posture estimation module adopts a three-dimensional posture estimator combined with deep learning. Aiming at the problem that the gesture of a user needs to be detected in real time in the using process, the estimator adopts a lightweight backbone network, and the real-time performance of the estimator is improved. In addition, aiming at the shielding problem in the posture estimation process, the human body posture estimation module adopts a posture estimation priority and joint redundancy strategy. The human body posture data obtained by the analysis of the human body posture estimation module can be further processed by the danger sensing module;
the main function of the danger sensing module is to judge whether the gesture and the position of a user are in danger according to the input human gesture estimation and the scene model, so as to prevent the user from being injured due to accidental falling or entering a dangerous environment. If the user is in an emergency state, an alarm can be sent out, the alarm is processed through the robot control subsystem, and a danger signal is transmitted to the information processing subsystem.
The voice processing subsystem receives and processes the user voice data, generates voice reply text data and digital human action expression text data, and transmits the voice reply text data and the digital human action expression text data to the digital human subsystem. Besides, if the voice information of the user includes the control instruction of the robot, the converted robot control text data is transmitted to the information processing subsystem for further processing. A voice processing subsystem comprising a voice recognition module and a LLM module:
the main function of the voice recognition module is to analyze the input voice data. The voice data is converted into text information through a voice recognition technology based on deep learning, and the text information is transmitted to an LLM module for further processing;
the LLM module has the main functions of generating a reply text according to the input text information by utilizing a large-scale language model and generating the text information of the action expression during reply according to the requirement. The module is the core of the whole emotion communication system, and the large-scale language model is utilized for generating and converting the content.
The digital person subsystem receives the image data, establishes a digital person model, generates digital person action information according to the related text data of the voice processing subsystem, renders the digital person action information, and outputs video information for display. In addition, the digital person subsystem synthesizes the received reply text data according to the set voice characteristics of the parent person of the user and outputs and plays the synthesized voice. The digital human subsystem comprises a model building module, an action generating module, an expression driving module, a model rendering module and a voice generating module:
the main function of the model building module is the generation of the digital persona. The model construction module needs to input various angle images of the object for display, acquires the spatial position and color information of different characteristic points of the surface of the human body according to the input various angle images, and fuses the spatial position and the color information to generate a three-dimensional model of the human body. And aiming at the established three-dimensional model, using OpenPose to detect key points of the human body. In addition, a 3D joint driving model of a real human body is constructed in combination with the SMPL method. Transmitting the generated model to an expression driving module and an action generating module for further processing;
the main function of the action generating module is to further generate the joint action of the corresponding digital human model on the basis of the action description text. The action generation module utilizes action text data generated by the voice processing subsystem, uses a Human-Motion-Diffusion model, inputs text information corresponding to actions, and combines an existing real Human body model to generate the actions;
the main function of the expression driving module is to further generate digital human expressions corresponding to the text on the basis of the expression text. Specifically, the existing 3D model is first subjected to expressive muscle parameter recognition, and facial data of the person are converted into expressive muscle parameters. On the basis, corresponding expression parameters are generated by analyzing the expression text generated by the voice processing subsystem. Driving surface key points of the model face on the basis of the new parameters, so that conversion from text to expression is realized;
the main functions of the model rendering module are to integrate human body actions generated by the action generating module and the expression driving module and conduct optimized rendering by using the UE5, so that actions and expressions of the whole digital person can be smoothly presented. The model rendering module outputs the finally generated data information and displays the data information on a display of the robot;
the main function of the speech generation module is to synthesize sound with a specified character tone from text. Before use, a section of Chinese voice of a designated person needs to be collected as a training set, and relevant tone characteristics are extracted. And training a speech clone synthesis model by combining the extracted tone characteristics. And finally, generating the voice of the appointed person according to the communication text information of the voice processing subsystem by using the obtained voice cloning model. The voice generating module outputs the finally generated voice data, and the finally generated voice data is played by using a loudspeaker of the robot, so that the effect of voice replying to a user is realized.
The robot control subsystem receives the control instructions of the information processing subsystem and the danger sensing subsystem, and generates control information of the robot to control the robot, including controlling the movement of the robot and the cooperation of the robot arm. In the motion process of the robot, the robot control subsystem models the scene and transmits model data of the scene to the information processing subsystem for further analysis and processing. The robot control subsystem comprises a navigation and scene modeling module, a man-machine cooperation module and a motion control module:
the main function of the navigation and scene model building module is to control the movement direction of the robot and the model building of the scene. The navigation and scene modeling module mainly utilizes VSLAM technology, utilizes input monocular image data to extract and analyze characteristic points, utilizes the extracted characteristic points to position in the moving process of the robot, and utilizes the obtained image to perform scene modeling. The navigation and scene model building module can receive the moving target information of the information processing subsystem and the danger sensing subsystem and transmit the moving track information of the robot to the motion control module;
the main function of the man-machine cooperation module is to assist the robot to cooperatively interact with a user in an actual physical environment. Combining the robot operation key point information generated by the information processing subsystem, making a motion track plan of the robot arm by the man-machine cooperation module, and transmitting the track information to the motion control module for processing;
the main function of the motion control module is to generate control signals of the robot. And according to the track data of the man-machine interaction module and the navigation and scene modeling module, the motion control module generates and outputs a corresponding robot control signal.
The invention applies the artificial intelligence content generation technology to the voice interaction and control of the pension robot. Aiming at the defect that the existing nursing robot lacks emotion communication, the language communication between the old and the nursing robot is realized by means of a large-scale language model and a voice synthesis technology. In addition, it presents a specific image of a parent in combination with digital man technology. And along with the communication, the action expression of the digital person image can be changed along with the dialogue content, so that the emotion communication requirement of the old is met.
The invention applies the artificial intelligence content generation technology to the presentation of the digital person, so that the action expression of the digital person is more real.
The emotion accompanying and nursing robot system disclosed by the invention integrates a plurality of technologies such as an artificial intelligence content generation technology, a virtual digital man technology and robot control, can follow the old in real time to carry out active or passive communication, and meets the emotion communication requirement of the old while realizing daily accompanying of the old. In addition, the robot can realize a certain nursing function, not only can monitor the physical condition of the old, but also can recognize whether the old is in a dangerous posture or not, and prevent the old from falling down. In addition, when the old people need help, the robot can provide man-machine cooperation functions within a meeting range.
The invention has the advantages that:
1. the emotion accompanying and nursing robot system designed and realized by the invention not only can meet the function of nursing the old people, but also can further enable the old people to carry out dialogue with the digital people with the appearance of relatives by means of the dialogue function of the large-scale language model, thereby realizing the emotion communication function similar to that of the accompanying and nursing robot.
2. By utilizing the information generation function of the artificial intelligence content generation technology, the invention can make the assistance and nursing effect of the robot more intelligent and can meet more demands of the old.
3. The robot can be controlled in a voice or remote mode, and can adapt to more complex application environments.
4. The system is in modularized design, and has good flexibility and expansibility.
Drawings
FIG. 1 is a schematic diagram of the system components of the present invention
FIG. 2 is a block diagram of a data analysis processing system according to the present invention
FIG. 3 is a schematic diagram of the system function of the present invention
Detailed Description
The invention is further described in detail below with reference to the accompanying drawings.
As shown in fig. 1, an emotion accompanying and nursing robot system based on an artificial intelligence content generation technology comprises an information acquisition device, a data analysis processing device and a man-machine interaction cooperation device.
The information acquisition device is a main device used by the robot to acquire environmental data. The information acquisition device includes a monocular/multi-camera and a microphone, and acquires scene data in the vicinity of the robot and voice information of the user, respectively. The information acquisition device transmits the collected environmental data to the data analysis processing device for processing.
The data analysis processing equipment is the key of the robot function implementation. The data analysis processing device consists of a data analysis processing system loaded on an onboard host of the robot. The data analysis processing equipment analyzes the acquired data and makes a decision to control the man-machine interaction cooperative equipment to realize the designed function. The data analysis processing equipment transmits the processing results, including voice information, display information, robot control signals and the like, to the man-machine interaction cooperative equipment for processing.
The man-machine interaction cooperative equipment is main equipment for performing robot functions and interacting with a user. The man-machine interaction cooperation device mainly comprises a display, a loudspeaker, a mechanical arm and a bottom pulley. The main function of the display is to display the digital human image of the user's relatives and the running information related to the robot. The main function of the speaker is to play sound having the characteristics of the voice of the person in which the user is present. The main function of the mechanical arm is to interact with a user, and assist the user when the user needs help or encounters danger. The main function of the bottom pulley of the robot is to realize the movement of the robot and improve the regional activity capability of the robot.
As shown in fig. 2, the analysis processing system in the data analysis processing apparatus is mainly composed of an information processing subsystem, a hazard perception subsystem, a voice processing subsystem, a digital person subsystem, and a robot control subsystem.
The information processing subsystem receives the image data and transmits the image data to the digital human subsystem, the robot control subsystem and the danger sensing system respectively. In addition, the information processing subsystem receives text data generated by the voice processing subsystem, a scene model generated by the robot control subsystem and human body posture data generated by the danger sensing subsystem, performs information fusion processing and makes decisions, and transmits control instructions of the robot to the robot control subsystem. The information processing subsystem comprises an image data processing module and an information fusion processing module:
the main function of the image data processing module is to perform preliminary processing on received image data by using a deep convolution network. In the robot setting stage, the image data processing module receives a parent image required for establishing a digital person and transmits the parent image to the digital person subsystem to establish a digital person model. In the running process of the robot, the image data processing module classifies the user image and the environment image by utilizing a depth network and respectively transmits the user image and the environment image to the hazard perception subsystem and the robot control subsystem for processing;
the main function of the information fusion processing module is to analyze and process the data of different modes and make decisions. The information fusion processing module processes text instructions generated by the voice processing subsystem by utilizing the deep network, combines pose estimation data and scene data of a user, generates key space position information in the robot movement process and transmits the key space position information to the robot control subsystem. If the user is in a dangerous condition, the information fusion module generates motion key point information processed by the robot according to the current user gesture estimation data and scene data and transmits the motion key point information to the robot control subsystem.
The danger sensing subsystem receives the user image processed by the information processing subsystem and carries out pose estimation, and the danger sensing subsystem is used for judging whether the pose and the position of the user are in danger or not according to scene data. If the user is in a dangerous state, the danger sensing subsystem sends a robot control instruction to the robot control subsystem and transmits information to the information processing subsystem. The danger sensing subsystem comprises a gesture estimation module and a danger processing module:
the main function of the gesture estimation module is to estimate the gesture of the user according to the input human body image. The human body posture estimation module adopts a three-dimensional posture estimator combined with deep learning. Aiming at the problem that the gesture of a user needs to be detected in real time in the using process, the estimator adopts a lightweight backbone network, and the real-time performance of the estimator is improved. In addition, aiming at the shielding problem in the posture estimation process, the human body posture estimation module adopts a posture estimation priority and joint redundancy strategy. The human body posture data obtained by the analysis of the human body posture estimation module can be further processed by the danger sensing module;
the main function of the danger sensing module is to judge whether the gesture and the position of a user are in danger according to the input human gesture estimation and the scene model, so as to prevent the user from being injured due to accidental falling or entering a dangerous environment. If the user is in an emergency state, an alarm can be sent out, the alarm is processed through the robot control subsystem, and a danger signal is transmitted to the information processing subsystem.
The voice processing subsystem receives and processes the user voice data, generates voice reply text data and digital human action expression text data, and transmits the voice reply text data and the digital human action expression text data to the digital human subsystem. Besides, if the voice information of the user includes the control instruction of the robot, the converted robot control text data is transmitted to the information processing subsystem for further processing. A voice processing subsystem comprising a voice recognition module and a LLM module:
the main function of the voice recognition module is to analyze the input voice data. The voice data is converted into text information through a voice recognition technology based on deep learning, and the text information is transmitted to an LLM module for further processing;
the LLM module has the main functions of generating a reply text according to the input text information by utilizing a large-scale language model and generating the text information of the action expression during reply according to the requirement. The module is the core of the whole emotion communication system, and the large-scale language model is utilized for generating and converting the content.
The digital person subsystem receives the image data, establishes a digital person model, generates digital person action information according to the related text data of the voice processing subsystem, renders the digital person action information, and outputs video information for display. In addition, the digital person subsystem synthesizes the received reply text data according to the set voice characteristics of the parent person of the user and outputs and plays the synthesized voice. The digital human subsystem comprises a model building module, an action generating module, an expression driving module, a model rendering module and a voice generating module:
the main function of the model building module is the generation of the digital persona. The model construction module needs to input various angle images of the object for display, acquires the spatial position and color information of different characteristic points of the surface of the human body according to the input various angle images, and fuses the spatial position and the color information to generate a three-dimensional model of the human body. And aiming at the established three-dimensional model, using OpenPose to detect key points of the human body. In addition, a 3D joint driving model of a real human body is constructed in combination with the SMPL method. Transmitting the generated model to an expression driving module and an action generating module for further processing;
the main function of the action generating module is to further generate the joint action of the corresponding digital human model on the basis of the action description text. The action generation module utilizes action text data generated by the voice processing subsystem, uses a Human-Motion-Diffusion model, inputs text information corresponding to actions, and combines an existing real Human body model to generate the actions;
the main function of the expression driving module is to further generate digital human expressions corresponding to the text on the basis of the expression text. Specifically, the existing 3D model is first subjected to expressive muscle parameter recognition, and facial data of the person are converted into expressive muscle parameters. On the basis, corresponding expression parameters are generated by analyzing the expression text generated by the voice processing subsystem. Driving surface key points of the model face on the basis of the new parameters, so that conversion from text to expression is realized;
the main functions of the model rendering module are to integrate human body actions generated by the action generating module and the expression driving module and conduct optimized rendering by using the UE5, so that actions and expressions of the whole digital person can be smoothly presented. The model rendering module outputs the finally generated data information and displays the data information on a display of the robot;
the main function of the speech generation module is to synthesize sound with a specified character tone from text. Before use, a section of Chinese voice of a designated person needs to be collected as a training set, and relevant tone characteristics are extracted. And training a speech clone synthesis model by combining the extracted tone characteristics. And finally, generating the voice of the appointed person according to the communication text information of the voice processing subsystem by using the obtained voice cloning model. The voice generating module outputs the finally generated voice data, and the finally generated voice data is played by using a loudspeaker of the robot, so that the effect of voice replying to a user is realized.
The robot control subsystem receives the control instructions of the information processing subsystem and the danger sensing subsystem, and generates control information of the robot to control the robot, including controlling the movement of the robot and the cooperation of the robot arm. In the motion process of the robot, the robot control subsystem models the scene and transmits model data of the scene to the information processing subsystem for further analysis and processing. The robot control subsystem comprises a navigation and scene modeling module, a man-machine cooperation module and a motion control module:
the main function of the navigation and scene model building module is to control the movement direction of the robot and the model building of the scene. The navigation and scene modeling module mainly utilizes VSLAM technology, utilizes input monocular image data to extract and analyze characteristic points, utilizes the extracted characteristic points to position in the moving process of the robot, and utilizes the obtained image to perform scene modeling. The navigation and scene model building module can receive the moving target information of the information processing subsystem and the danger sensing subsystem and transmit the moving track information of the robot to the motion control module;
the main function of the man-machine cooperation module is to assist the robot to cooperatively interact with a user in an actual physical environment. Combining the robot operation key point information generated by the information processing subsystem, making a motion track plan of the robot arm by the man-machine cooperation module, and transmitting the track information to the motion control module for processing;
the main function of the motion control module is to generate control signals of the robot. And according to the track data of the man-machine interaction module and the navigation and scene modeling module, the motion control module generates and outputs a corresponding robot control signal.
As shown in fig. 3, a functional schematic of the present invention is shown. The old can communicate with the robot or use voice to request help when no other family is attending to the accompanying. The bot converts the recognized voice information into text and replies to the recognized text information using artificial intelligence generation content techniques (e.g., large-scale language models). The text information generated by reply comprises the text of the answer of the communication and the text description of the digital human action expression. If the requirement of the old is that help is requested, the artificial intelligence content generation technology can be used for assisting in interactive cooperation decision making, and help is provided for the old by using the mechanical arm and the pulley of the robot. And synthesizing the exchanged answer text into a sound similar to the voice characteristics of the appointed relatives of the old by utilizing a voice synthesis technology to reply. The digital human image of the relatives of the user is displayed on the display of the robot, and corresponding actions and expressions are made along with the voice. In addition, the robot can estimate the pose and the state of the old people in real time and detect the danger of the old people.
The embodiments described in the present specification are merely examples of implementation forms of the inventive concept, and the scope of protection of the present invention should not be construed as being limited to the implementation of the specific forms described, but the scope of protection of the present invention and equivalent technical means that can be conceived by those skilled in the art based on the inventive concept.

Claims (3)

1. The emotion communication accompanying and nursing robot system based on the artificial intelligence generated content comprises an information acquisition device, a data analysis processing device and a man-machine interaction cooperation device;
the information acquisition equipment acquires environmental data through a sensor; including monocular/binocular cameras and microphones; the monocular/binocular camera obtains scene information around the robot, including a user and a scene image; the microphone collects voice information of a user; the information acquisition equipment sends the acquired voice information and scene data to the data analysis processing equipment for processing;
the man-machine interaction cooperative equipment is used for interacting with people and is a result execution link of the data analysis processing equipment; the man-machine interaction cooperative equipment comprises a display, a loudspeaker, a mechanical arm and a bottom motion mechanism of the robot; the man-machine interaction cooperative equipment executes the processing result of the data analysis processing equipment; displaying the digital person image of the designated parent using the display; playing the synthesized sound with the designated parent tone by adopting a loudspeaker; the interactive collaboration is carried out with the user or the interactive language instruction is executed through the mechanical arm; the robot has an autonomous following movement function;
the data analysis processing equipment processes the data collected by the analysis sensor and controls the man-machine interaction equipment to execute the corresponding function for realizing the design; the data analysis processing equipment consists of a data analysis processing system loaded on an onboard host of the robot; the data analysis processing system consists of an information processing subsystem, a voice processing subsystem, a digital human subsystem, a danger sensing subsystem and a robot control subsystem;
the information processing subsystem receives the image data and transmits the image data to the digital human subsystem, the robot control subsystem and the danger sensing system respectively; in addition, the information processing subsystem receives text data generated by the voice processing subsystem, a scene model generated by the robot control subsystem and human body posture data generated by the danger sensing subsystem, performs information fusion processing and makes a decision, and transmits a control instruction of the robot to the robot control subsystem; the information processing subsystem comprises an image data processing module and an information fusion processing module:
the image data processing module performs preliminary processing on the received image data by using a deep convolution network; in the robot setting stage, the image data processing module receives a parent image required by establishing a digital person and transmits the parent image to the digital person subsystem to establish a digital person model; in the running process of the robot, the image data processing module classifies the user image and the environment image by utilizing a depth network and respectively transmits the user image and the environment image to the hazard perception subsystem and the robot control subsystem for processing;
the information fusion processing module analyzes and processes the data of different modes and makes decisions; the information fusion processing module processes text instructions generated by the voice processing subsystem by utilizing the deep network, combines pose estimation data and scene data of a user, generates key space position information in the robot movement process and transmits the key space position information to the robot control subsystem; if the user is in a dangerous condition, the information fusion module generates motion key point information processed by the robot according to the current user gesture estimation data and scene data and transmits the motion key point information to the robot control subsystem;
the danger sensing subsystem receives the user image processed by the information processing subsystem and carries out pose estimation, and judges whether the pose of the user and the position of the user are in danger or not according to scene data; if the user is in a dangerous state, the danger sensing subsystem sends a robot control instruction to the robot control subsystem and transmits information to the information processing subsystem; the danger sensing subsystem comprises a gesture estimation module and a danger processing module:
the gesture estimation module estimates the gesture of a user according to the input human body image; the attitude estimation module adopts a three-dimensional attitude estimator combined with deep learning; aiming at the problem that the gesture of a user needs to be detected in real time in the use process, the estimator adopts a lightweight backbone network, so that the real-time performance of the estimator is improved; in addition, aiming at the shielding problem in the posture estimation process, the human body posture estimation module adopts a posture estimation priority and joint redundancy strategy; the human body posture data obtained by the analysis of the human body posture estimation module can be further processed by the danger sensing module;
the danger sensing module judges whether the gesture and the position of a user are in danger according to the input human gesture estimation and the scene model, so as to prevent the user from being injured due to accidental falling or entering a dangerous environment; if the user is in an emergency state, an alarm can be sent out, the alarm is processed through the robot control subsystem, and a dangerous signal is transmitted to the information processing subsystem;
the voice processing subsystem receives and processes the voice data of the user, generates voice reply text data and digital human action expression text data, and transmits the voice reply text data and the digital human action expression text data to the digital human subsystem; besides, if the voice information of the user includes a control instruction of the robot, the converted robot control text data is transmitted to the information processing subsystem for further processing; a voice processing subsystem comprising a voice recognition module and a LLM module:
the voice recognition module analyzes the input voice data; the voice data is converted into text information through a voice recognition technology based on deep learning, and the text information is transmitted to an LLM module for further processing;
the LLM module generates a reply text according to the input text information by utilizing a large-scale language model, and generates text information of an action expression during reply according to requirements; the module is the core of the whole emotion communication system, and uses a large-scale language model to generate and convert contents;
the digital person subsystem receives the image data, establishes a digital person model, generates digital person action information according to the related text data of the voice processing subsystem, renders the digital person action information, and outputs video information for display; in addition, the digital person subsystem synthesizes the received reply text data according to the set voice characteristics of the relatives of the user and outputs and plays the synthesized voice; the digital human subsystem comprises a model building module, an action generating module, an expression driving module, a model rendering module and a voice generating module:
the model building module generates a digital human image; the model construction module is required to input various angle images of the object for display, acquires the spatial position and color information of different characteristic points of the surface of the human body according to the input various angle images, and fuses the spatial position and the color information to generate a three-dimensional model of the human body; aiming at the established three-dimensional model, using OpenPose to detect key points of the human body; in addition, a 3D joint driving model of a real human body is built by combining an SMPL method; transmitting the generated model to an expression driving module and an action generating module for further processing;
the action generating module is used for further generating the joint action of the corresponding digital human model on the basis of the action description text; the action generation module utilizes action text data generated by the voice processing subsystem, uses a Human-Motion-Diffusion model, inputs text information corresponding to actions, and combines an existing real Human body model to generate the actions;
the expression driving module further generates digital human expressions corresponding to the text on the basis of the expression text; specifically, facial data of a person is converted into expression muscle parameters by carrying out expression muscle parameter identification on the existing 3D model; on the basis, corresponding expression parameters are generated by analyzing the expression text generated by the voice processing subsystem; driving surface key points of the model face on the basis of the new parameters, so that conversion from text to expression is realized;
the model rendering module integrates the human body actions generated by the action generating module and the expression driving module and performs optimized rendering by using engines such as UE5, so that the actions and expressions of the whole digital person can be smoothly presented; the model rendering module outputs the finally generated data information and displays the data information on a display of the robot;
the voice generating module synthesizes sound with the tone of the appointed character according to the text; before use, a section of Chinese voice of a designated person is required to be collected as a training set, and relevant tone characteristics are extracted; training a speech cloning synthesis model by combining the extracted tone characteristics; finally, utilizing the obtained voice cloning model to generate the voice of the appointed person according to the communication text information of the voice processing subsystem; the voice generating module outputs the finally generated voice data, and plays the voice data by using a loudspeaker of the robot, so that the effect of voice replying to a user is realized;
the robot control subsystem receives control instructions of the information processing subsystem and the hazard perception subsystem, generates control information of the robot and controls the robot, and comprises control of movement of the robot and cooperation of the robot arm; in the motion process of the robot, the robot control subsystem models a scene and transmits model data of the scene to the information processing subsystem for further analysis and processing; the robot control subsystem comprises a navigation and scene modeling module, a man-machine cooperation module and a motion control module:
the navigation and scene model building module controls the movement direction of the robot and builds a model of the scene. The navigation and scene modeling module utilizes the VSLAM technology, utilizes input monocular image data to extract and analyze characteristic points, utilizes the extracted characteristic points to position in the moving process of the robot, and utilizes the obtained image to perform scene modeling; the navigation and scene model building module can receive the moving target information of the information processing subsystem and the danger sensing subsystem and transmit the moving track information of the robot to the motion control module;
the man-machine cooperation module assists the robot to perform cooperation interaction with a user in an actual physical environment; combining the robot operation key point information generated by the information processing subsystem, making a motion track plan of the robot arm by the man-machine cooperation module, and transmitting the track information to the motion control module for processing;
the motion control module generates a control signal of the robot; and according to the track data of the man-machine interaction module and the navigation and scene modeling module, the motion control module generates and outputs a corresponding robot control signal.
2. An emotion interchange accompanying aged robot system according to claim 1, wherein: applying the artificial intelligence content generation technology to voice interaction and control of the pension robot; aiming at the defect that the existing nursing robot lacks emotion communication, the language communication between the old and the nursing robot is realized by means of a large-scale language model and a voice synthesis technology; in addition, it presents a specific image of a parent in combination with digital man-in-the-art; and along with the communication, the action expression of the digital person image can be changed along with the dialogue content, so that the emotion communication requirement of the old is met.
3. An emotion interchange accompanying aged robot system according to claim 1, wherein: the artificial intelligence content generation technology is applied to the presentation of the digital person, so that the action expression of the digital person is more real.
CN202310274857.1A 2023-03-15 2023-03-15 Emotion communication accompanying and nursing robot system based on artificial intelligence generated content Pending CN116572260A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310274857.1A CN116572260A (en) 2023-03-15 2023-03-15 Emotion communication accompanying and nursing robot system based on artificial intelligence generated content

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310274857.1A CN116572260A (en) 2023-03-15 2023-03-15 Emotion communication accompanying and nursing robot system based on artificial intelligence generated content

Publications (1)

Publication Number Publication Date
CN116572260A true CN116572260A (en) 2023-08-11

Family

ID=87532850

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310274857.1A Pending CN116572260A (en) 2023-03-15 2023-03-15 Emotion communication accompanying and nursing robot system based on artificial intelligence generated content

Country Status (1)

Country Link
CN (1) CN116572260A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116883608A (en) * 2023-09-05 2023-10-13 武汉纺织大学 Multi-mode digital person social attribute control method and related device

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116883608A (en) * 2023-09-05 2023-10-13 武汉纺织大学 Multi-mode digital person social attribute control method and related device
CN116883608B (en) * 2023-09-05 2023-12-12 武汉纺织大学 Multi-mode digital person social attribute control method and related device

Similar Documents

Publication Publication Date Title
CN105468145B (en) A kind of robot man-machine interaction method and device based on gesture and speech recognition
JP6816925B2 (en) Data processing method and equipment for childcare robots
CN110605724B (en) Intelligence endowment robot that accompanies
CN106737760B (en) Human-type intelligent robot and human-computer communication system
CN108983636B (en) Man-machine intelligent symbiotic platform system
CN109172066B (en) Intelligent prosthetic hand based on voice control and visual recognition and system and method thereof
US20200410739A1 (en) Robot and method for operating same
CN104410883A (en) Mobile wearable non-contact interaction system and method
EP2499550A1 (en) Avatar-based virtual collaborative assistance
JP7120254B2 (en) Information processing device, information processing method, and program
WO2019093646A1 (en) Electronic device capable of moving and operating method thereof
CN102895093A (en) Walker aid robot tracking system and walker aid robot tracking method based on RGB-D (red, green and blue-depth) sensor
KR20220124810A (en) Gesture control system using biopotentials and vectors
CN113760100B (en) Man-machine interaction equipment with virtual image generation, display and control functions
CN116572260A (en) Emotion communication accompanying and nursing robot system based on artificial intelligence generated content
CN106377228A (en) Monitoring and hierarchical-control method for state of unmanned aerial vehicle operator based on Kinect
KR20160072621A (en) Artificial intelligence robot service system
JP2024023193A (en) Information processing device and information processing method
JP7375770B2 (en) Information processing device, information processing method, and program
Du et al. Human–robot collaborative control in a virtual-reality-based telepresence system
CN107643820B (en) VR passive robot and implementation method thereof
WO2022188022A1 (en) Hearing-based perception system and method for using same
CN111134974B (en) Wheelchair robot system based on augmented reality and multi-mode biological signals
CN107783639A (en) Virtual reality leisure learning system
CN106997449A (en) Robot and face identification method with face identification functions

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination