CN111435268A

CN111435268A - Human-computer interaction method based on image recognition and reconstruction and system and device using same

Info

Publication number: CN111435268A
Application number: CN201910027460.6A
Authority: CN
Inventors: 梅俊峰
Original assignee: Hefei Honghuida Technology Co ltd
Current assignee: Hefei Honghuida Technology Co ltd
Priority date: 2019-01-11
Filing date: 2019-01-11
Publication date: 2020-07-21

Abstract

The invention provides a human-computer interaction method based on image recognition and reconstruction and a system and a device using the same. The human-computer interaction method based on image recognition and reconstruction can collect and analyze a video sequence in a real scene/video, reconstruct a motion model and a voice model of the robot according to the collected and analyzed data, further match control data for each motion model and each voice model, construct a structural data frame between interaction information and control data, and complete the interaction with people in the real scene under the matching control of the interaction information and the control data after the robot detects the interaction information in the real scene, so that the human-computer interaction is realized.

Description

Human-computer interaction method based on image recognition and reconstruction and system and device using same

This patent application claims priority from us 62743682 provisional patent application filed on 10/2018 and is incorporated by reference.

Technical Field

The invention relates to the technical field of image capture, identification and reconstruction, in particular to a human-computer interaction method based on image identification and reconstruction and a system and a device using the same.

Background

With the continuous development of information identification technology, various interactive machine devices are continuously appeared, and the spiritual culture life of people is gradually enriched.

However, when the existing interactive machine device identifies a real scene, it only simply simulates facial expressions and gestures in the scene, that is, the existing interactive machine device only simply simulates and simulates the interaction with the real scene, but cannot identify information in the scene to match corresponding interactive contents, so that the interactive degree of the interactive machine device in the prior art is low.

In view of the above, it is necessary to provide a new human-computer interaction method based on image recognition and reconstruction to solve the above problems.

Disclosure of Invention

The invention aims to provide a human-computer interaction method based on image recognition and reconstruction and a system and a device using the same, wherein the human-computer interaction method can be used for recognizing and reconstructing images and voice information in a video sequence to establish an interaction database; and further calling and matching a model in an interaction database by identifying image, motion or sound information in a display scene to realize the interaction between the robot and the display scene.

In order to achieve the above object, the present invention provides a human-computer interaction method based on image recognition and reconstruction, comprising:

s1, collecting a standard video sequence of a standard user, characterizing and demodulating the standard video sequence, and acquiring and defining a corresponding motion model and a voice model of the standard video sequence;

s2, respectively extracting motion characteristic information and voice characteristic information in the motion model and the voice model according to the time sequence of the standard video sequence;

s3, detecting the individual video sequence of the target user, representing the individual video sequence according to the time sequence of the individual video sequence, and respectively acquiring the motion individual information and the voice individual information of the target user;

and S4, matching the motion characteristic information and the motion personality information, and the voice characteristic information and the voice personality information at the same time, so as to call the corresponding motion model and the voice model to directly/indirectly control the robot to display and/or move, and realize the interaction between the target user and the robot.

As a further improvement of the present invention, the step S1 specifically includes:

s11, collecting a standard video sequence of a standard user, analyzing the standard video sequence, and acquiring a standard image sequence and a standard voice sequence corresponding to the standard video sequence, wherein the standard image sequence comprises a plurality of standard image frames arranged according to a time sequence;

s12, each standard image frame is characterized and demodulated, a plurality of key points in the standard image frame are defined at the same time, and the key points are marked in each standard image frame;

s13, determining the displacement track of each key point according to the coordinate change of each key point in different standard image frames in a two-dimensional plane;

s14, determining the rotation track of each key point according to the angle change of each key point in different standard image frames in a three-dimensional space;

s15, matching displacement tracks and rotation tracks of the key points according to the time sequence of a standard video sequence to construct a motion model corresponding to the standard video sequence;

and S16, characterizing and demodulating the standard voice sequence according to the time sequence of the standard video sequence, defining an audio mark of the standard voice sequence in each time sequence, and matching the audio mark with the standard image frame of the corresponding time sequence to construct a voice model corresponding to the standard video sequence.

As a further improvement of the present invention, the motion model includes an expression model and an action model, and the expression model is used for reconstructing a facial image of the robot and controlling the robot to generate corresponding expression changes; the motion model is used for controlling the robot to generate corresponding motion/posture change.

As a further improvement of the present invention, the step S2 specifically includes: extracting motion characteristic information of the motion model and voice characteristic information of the voice model according to a time sequence of a standard video sequence, wherein the motion characteristic information is used for controlling the robot to generate corresponding interactive action; the motion characteristic information comprises limb motion characteristic points, limb motion units, facial expression characteristic points and expression motion units, and the voice characteristic information is used for controlling the robot to generate corresponding interactive sound; the voice feature information comprises tone, tone and acoustic signal features of phonemes changing along with time sequence.

As a further improvement of the invention, the human-computer interaction method further comprises the step of establishing a database, wherein the establishment of the database at least comprises the establishment of a structural data frame between the motion model and the robot and the establishment of a structural data frame between the voice model and the robot, and the motion model and the voice model are both stored in the database.

In order to achieve the above object, the present invention provides a human-computer interaction system based on image recognition and reconstruction, comprising:

the video acquisition processing unit comprises a video acquisition module and a video processing module, wherein the video acquisition module is used for acquiring a standard video sequence of a standard user and a personalized video sequence of a target user; the video processing module is used for characterizing and demodulating the standard video sequence and the personalized video sequence;

the motion control unit is electrically connected with the video acquisition and processing unit and comprises a motion track extraction module and a motion state fitting module, wherein the motion track extraction module is used for acquiring the displacement track and the rotation track of each key point in a standard video sequence; the motion state fitting module is used for constructing a motion model corresponding to the standard video sequence;

the voice synthesis unit is respectively electrically connected with the video acquisition processing unit and the motion control unit and comprises a voice extraction module and an audio reconstruction module, wherein the voice extraction module is used for extracting audio information in a standard video sequence according to the time sequence of the standard video sequence; the audio reconstruction module is used for reconstructing a voice model according to the time sequence of a standard video sequence;

the feature point matching unit is electrically connected with the video acquisition and processing unit, the motion control unit and the voice synthesis unit respectively, and comprises a motion feature matching module and a voice feature matching module, wherein the motion feature matching module is used for matching motion personality information of a target user with a motion model so as to generate a corresponding motion control instruction; the voice characteristic matching module is used for matching the voice personality information of the target user with the voice model so as to generate a corresponding voice control instruction;

and the behavior execution unit is electrically connected with the characteristic point matching unit and is used for receiving the motion control instruction and/or the voice control instruction sent by the characteristic point matching unit so as to interact with a target user.

As a further improvement of the invention, the motion control unit further comprises an expression reconstruction module, the expression reconstruction module comprises an expression fitting module and an expression driving module, and the expression fitting module is used for fitting and reconstructing a facial expression model of the robot according to each key point in a standard video sequence; the expression driving module is used for driving the facial expression model to generate corresponding expressions according to the displacement tracks of the key points in the standard video sequence.

As a further improvement of the present invention, the human-computer interaction system further includes a storage unit, and the storage unit is electrically connected to the motion control unit and the speech synthesis unit respectively, so as to store the motion model, the speech model, and the facial expression model.

In order to achieve the above object, the present invention further provides an image-based recognition and reconstruction interactive apparatus, which is a robot including a main body, a head movably connected to the main body, and a trunk connected to the main body, wherein: the robot also comprises a video acquisition module used for acquiring a standard video sequence of a standard user and a personalized video sequence of a target user and characterizing and demodulating the standard video sequence and the personalized video sequence;

the model reconstruction module is used for constructing a motion model and a voice model according to a standard video sequence;

and the data processing module is used for matching the motion model and the voice model with the personalized video sequence and generating a corresponding control instruction so as to control the robot to perform corresponding display/action.

To achieve the above object, the present invention further provides a computer-readable storage medium storing a computer program, which, when being executed by a processor, can implement the aforementioned human-computer interaction method based on image recognition and reconstruction.

The invention has the beneficial effects that: the human-computer interaction method based on image recognition and reconstruction can collect and analyze a video sequence in a real scene/video, reconstruct a motion model and a voice model of a robot according to collected and analyzed data, further match control data for each motion model and each voice model, construct a structural data frame between interaction information and control data, and complete the interaction with people in the real scene under the matching control of the interaction information and the control data after the robot detects the interaction information in the real scene, so that the human-computer interaction is realized.

Drawings

FIG. 1 is a flow chart of a human-computer interaction method based on image recognition and reconstruction according to the present invention.

Fig. 2 is a flowchart of step S1 in fig. 1.

FIG. 3 is a block diagram of a human-computer interaction system for image-based recognition and reconstruction according to the present invention.

Fig. 4 is a block schematic diagram of the motion control unit of fig. 3.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in detail with reference to the accompanying drawings and specific embodiments.

Referring to fig. 1, a human-computer interaction method 100 based on image recognition and reconstruction according to the present invention includes:

The following description section will describe the human-computer interaction method 100 based on image recognition and reconstruction in detail.

Referring to fig. 2, step S1 specifically includes:

Step S1 is mainly performed by the image capturing unit, and the image capturing unit includes an image capturing device and an image analyzing device. Further, in step S11, the standard image sequence is mainly captured by the image capturing device for obtaining/shooting a local/whole standard image sequence of the standard user, where the standard image sequence may be directly obtained by the image capturing device or may be a part of video information captured from a network or a video. Furthermore, the standard image sequence is obtained by combining a plurality of standard image frames which are arranged according to a time sequence, and the image analysis device can represent and mediate the plurality of standard image frames according to the time sequence so as to capture the motion information nodes and the voice information nodes in each standard image frame, fit the motion information nodes into the standard image sequence according to the time sequence and fit the voice information nodes into the standard voice sequence.

In step S12, the image analysis device will characterize and demodulate each standard image frame to define a number of key points in the standard image frame, and mark the key points in each standard image frame, specifically, in the present invention, the key points include image points of joint/muscle movement nodes, eyes, mouth and nose, eyebrows, facial contours, etc. in the standard image frame.

In steps S13 to S15, the image analysis device is further configured to determine a displacement trajectory and a rotation trajectory of each key point according to a coordinate change and an angle change of each key point in the different standard image frames in the two-dimensional plane/three-dimensional space; and constructing a motion model corresponding to the standard video sequence according to the key points, the displacement track and the rotation track.

In the invention, the motion model comprises an expression model and an action model, and the expression model is used for reconstructing a facial image of the robot and controlling the robot to generate corresponding expression changes. Specifically, the expression model is obtained through reconstruction by an image rendering device, in one embodiment of the invention, the image rendering device is configured to detect key points of the face of a standard user, the image rendering device can be used for defining a face rectangle of the standard user, giving initial mark positions of the key points in the face rectangle, and further obtaining the expression model of the standard user through display shape recursive fitting so as to control the robot to generate corresponding facial emotion expressions; in the invention, the expression model can be a 3D real human face model and can also be a cartoon image model.

The motion model is used to control the robot to generate corresponding motion/posture changes, and the following description section will exemplify a head motion model with the motion model as a standard user. Specifically, the establishment of the head motion model is mainly obtained by fitting through a motion construction device, and in the invention, the motion construction device is used for receiving the angle change of each key point in the standard image frame in a three-dimensional space, wherein the angle change of the key point in the three-dimensional space at least comprises a pitch angle change and a yaw angle change.

Specifically, when the head of a human face is involved in a standard image sequence to rotate, the relative positions of key points of the human face in different standard image frames change along with the rotation of the head, an action construction device is used for acquiring the position information of the key points of the head in a first time sequence, establishing a three-dimensional coordinate of the head of a standard user, and then acquiring the yaw angle value and the pitch angle value of the head in a three-dimensional space through the position change of the key points of the head in a subsequent time sequence in a fitting manner; furthermore, the action building device can also define the movement speed of the head of the standard user in each coordinate axis direction in the three-dimensional coordinate at the same time so as to finally obtain the action model of the head of the standard user through fitting. And finally, the action construction device fits the expression model and the head action model of the head of the standard user according to the time sequence of the standard video sequence to define the head movement model of the standard user, wherein the definition of the head movement model at least comprises the meaning of the movement model and a corresponding control structure data frame.

The step S2 specifically includes: and extracting the motion characteristic information of the motion model and the voice characteristic information of the voice model according to the time sequence of the standard video sequence. The motion characteristic information is used for controlling the robot to generate corresponding interactive action; in the invention, the motion characteristic information comprises limb motion characteristic points, limb motion units, facial expression characteristic points and expression motion units, and at least part of the motion characteristic information comprises key points of a motion model so as to match motion interaction information and the motion model, so that the corresponding motion model is called after the interaction information is detected, and the robot is controlled to generate corresponding motion according to the motion characteristic information.

The voice characteristic information is used for controlling the robot to generate corresponding interactive sound; specifically, the voice feature information includes tone, tone and acoustic signal features of phonemes varying with time sequence, the voice feature information is at least partially included in audio markers of the voice sequence to match voice interaction information and voice models, wherein the voice interaction information is voice interaction information points arranged according to time sequence, and when the voice interaction information points are matched with the audio markers arranged according to time sequence, the voice interaction information can call corresponding voice models, and is played according to book sequence through a robot and is matched with corresponding motion models.

Step S3 specifically includes: and detecting the individual video sequence of the target user, representing the individual video sequence according to the time sequence of the individual video sequence, and respectively acquiring the motion individual information and the voice individual information of the target user. Specifically, the personalized video sequence is acquired by an image acquisition device, and in the invention, the personalized video sequence can be image information acquired by the image acquisition device in real time or recorded video information. Furthermore, the image analysis device can characterize the personalized video sequence so as to decompose the personalized video sequence into motion personalized information and voice personalized information, wherein the motion personalized information comprises a personalized action characteristic point for marking and displaying the action characteristic of the target user, and in the invention, the motion personalized information comprises expression personalized information and action personalized information; the voice personality information is used for marking and displaying the personality voice feature points of the voice features of the target user. Furthermore, the motion personality information and the voice personality information can be used as interaction information to be matched with the motion model and the voice model.

Step S4 specifically includes: and meanwhile, the motion characteristic information and the motion personality information, and the voice characteristic information and the voice personality information are matched to call corresponding motion models and voice models to directly/indirectly control the robot to display and/or move, so that interaction between a target user and the robot is realized. Specifically, in the invention, the matching of the motion characteristic information and the motion personality information is the matching of limb motion characteristic points, limb motion units, facial expression characteristic points and expression motion units in the motion characteristic information and the personality action characteristic points in the motion personality information; the matching between the voice characteristic information and the voice personality information is the matching between the acoustic signal characteristics and the personality voice characteristic points.

It should be noted that the human-computer interaction method 100 based on image recognition and reconstruction of the present invention further includes the establishment of a database, the establishment of the database at least includes the establishment of a structural data frame between the motion model and the robot, and the establishment of a structural data frame between the voice model and the robot, and the establishment of the structural data frame can facilitate the matching between the personalized video sequence and the standard video sequence. The structure data frame also comprises control data matched with each motion model and each voice model respectively, the control data can be used for calling the corresponding motion model and/or the corresponding voice model according to the personalized video sequence so as to drive the robot to generate corresponding interaction, and further, the database is also used for storing the motion models and the voice models.

Referring to fig. 3, the present invention further provides a human-computer interaction system 200 for image-based recognition and reconstruction, comprising: the device comprises a video acquisition processing unit 1, a motion control unit 2, a voice synthesis unit 3, a feature point matching unit 4 and a behavior execution unit 5.

Specifically, the video acquisition processing unit 1 comprises a video acquisition module 11 and a video processing module 12, wherein the video acquisition module 11 is used for acquiring a standard video sequence of a standard user and a personalized video sequence of a target user; the video processing module 12 is configured to characterize and demodulate the standard video sequence and the personalized video sequence to separate image information and voice information in the standard video sequence and the personalized video sequence, and arrange the image information and the voice information according to a time sequence to form a standard image sequence and a standard voice sequence for defining robot motion, and identify a matched personalized image sequence and a matched personalized voice sequence.

Referring to fig. 4 in combination with fig. 3, the motion control unit 2 is electrically connected to the video acquisition and processing unit 1, and includes a motion trajectory extraction module 21 and a motion state fitting module 22, where the motion trajectory extraction module 21 is configured to obtain a displacement trajectory and a rotation trajectory of each key point in the standard video sequence; the motion state fitting module 22 is configured to construct a motion model corresponding to the standard video sequence.

Further, the motion control unit 2 further includes an expression reconstruction module 23, the expression reconstruction module 23 includes an expression fitting module 231 and an expression driving module 232, and the expression fitting module 231 is configured to fit and reconstruct a facial expression model of the robot according to each key point in the standard video sequence; the expression driving module 232 is configured to drive the facial expression model to generate a corresponding expression according to the displacement trajectory of each key point in the standard video sequence.

The voice synthesis unit 3 is electrically connected with the video acquisition and processing unit 1, and comprises a voice extraction module 31 and an audio reconstruction module 32, wherein the voice extraction module 31 is used for extracting audio information in a standard voice sequence according to a time sequence of the standard video sequence; the audio reconstruction module 32 is used to reconstruct the speech model according to the time sequence of the standard video sequence.

The feature point matching unit 4 is electrically connected with the video acquisition and processing unit 1, the motion control unit 2 and the voice synthesis unit 3 respectively, and comprises a motion feature matching module 41 and a voice feature matching module 42, wherein the motion feature matching module 41 is used for matching the motion personality information of the target user with the motion model to generate a corresponding motion control instruction; the voice feature matching 42 module is used for matching the voice personality information of the target user with the voice model to generate a corresponding voice control instruction. It should be noted that, in the present invention, the motion control command and the voice control command are both part of the structural data framework, so as to construct a control model required by the human-computer interaction system 200 for interaction based on image recognition and reconstruction.

The behavior execution unit 5 is electrically connected to the feature point matching unit 4, and is configured to receive a motion control instruction and/or a voice control instruction sent by the feature point matching unit 4, so as to interact with a target user. Further, the behavior execution unit 5 is further configured to control the motion of the robot, and specifically, the behavior execution unit 5 may classify the motion control command and divide the motion control command into at least an expression control command and an action control command, where the expression control command is configured to control the robot to generate a corresponding expression change through the structural data frame; the motion control command is used for controlling the head/trunk part of the robot to generate corresponding motion through the structural data frame.

Further, the human-computer interaction system 200 based on image recognition and reconstruction further includes a storage unit 6, the storage unit 6 is electrically connected to the motion control unit 2, the voice synthesis unit 3, and the feature point matching unit 4 respectively to store the motion model, the voice model, and the facial expression model, and the feature point matching unit 4 can match and call the data stored in the storage unit 6.

The invention also provides an interactive device based on the image recognition and reconstruction, and in the invention, the interactive device based on the image recognition and reconstruction is a robot, the robot comprises a main body, a head part movably connected with the main body and a body part connected with the main body, wherein the robot further comprises a video acquisition module, a model reconstruction module and a data processing module, and the video acquisition module is used for acquiring a standard video sequence of a standard user and a personalized video sequence of a target user and characterizing and demodulating the standard video sequence and the personalized video sequence. Preferably, the video capture module of the present invention comprises an image capture device and an image analysis device, wherein the image capture device is housed in the head, and the image analysis device is housed in the head or the torso.

The model reconstruction module is electrically connected with the video acquisition module and contained in the trunk part, and further the model reconstruction module is used for constructing a motion model and a voice model according to a standard video sequence.

Specifically, the model reconstruction module further comprises a database and a graph renderer, wherein the database is used for storing video sequences collected by the image collection device, the video sequences comprise standard video sequences and individual video sequences, and the standard video sequences and the individual video sequences are analyzed to obtain key points in the standard video sequences and individual characteristic points in the individual video sequences; the graphic renderer is used for rendering and generating a facial expression model according to key points in a standard video sequence, and it should be noted that the head is provided with a display screen for displaying and playing the facial expression model, and the facial expression model can be a 3D human face expression model or an animation expression model.

The data processing module is used for matching the motion model and the voice model with the individual video sequence and generating a corresponding control instruction so as to control the robot to perform corresponding display/action.

In the use process of the robot, firstly, a standard video sequence of a standard user is acquired through a video acquisition module, and a motion model and a voice model are established through big data/mechanical learning, wherein the motion model and the voice model are time sequence data obtained by fitting continuous information points of the standard user in a corresponding time sequence, further, a model reconstruction module can define the motion model and the voice model so as to fix the meanings of the motion model and the voice model, and the motion model and the voice model with the same meanings can be displayed/played simultaneously or independently.

Then, the data processing module can extract and define the mark information in the motion model and the voice model, in the invention, the mark information is at least analyzed and reconstructed into three different control data, including expression mark information for driving expression change, action mark information for driving the robot to act and voice mark information for driving the robot to sound, wherein, the expression mark information and the action mark information are simultaneously constructed into the motion mark information for controlling the face and trunk movement of the robot according to time sequence; the voice mark information can be played independently, namely, the voice mark information can be used for controlling the robot to sound independently, and can also be matched with the motion mark information to control the robot to generate corresponding interactive action.

Furthermore, the video acquisition module can acquire the individual video sequence of the target user in the real environment and analyze the individual video sequence to acquire the corresponding individual motion information and/or the individual voice information. Specifically, the individual movement information and/or the individual voice information can be matched with mark information which is not classified in the data processing module, and the data processing module can classify the mark information according to the position of the action in the mark information so as to respectively control different positions of the robot to change; the main part and the trunk part are movably connected through a plurality of driving motors, the driving motors are electrically connected with the data processing module, and the mark information classified by the data processing module is respectively used for driving the driving motors at different positions to move.

And finally, the video acquisition module acquires a personalized video sequence of a target user and analyzes the personalized video sequence to extract interactive information in the personalized video sequence.

The present invention further provides a computer readable storage medium storing a computer program, which when executed by a processor, can implement the human-computer interaction method 100 based on image recognition and reconstruction of the present invention, i.e. in the present invention, the human-computer interaction method 100 based on image recognition and reconstruction of the present invention can be stored in the computer readable storage medium in the form of a computer program. Based on the above understanding, all or part of the technical solutions of the present invention, which essentially or contributes to the prior art, may be embodied in the form of a software product, which is stored in a storage medium and includes several instructions, so as to enable a computer device (which may be a personal computer, a server, or a network device, etc.) or a processor to execute all or part of the contents of the methods in the embodiments of the present invention, and the computer readable storage medium includes a usb disk, a removable hard disk, or an optical disk, etc. which may store program codes.

In summary, the human-computer interaction method 100 based on image recognition and reconstruction of the present invention can collect and analyze a video sequence in a real scene/video, reconstruct a motion model and a voice model of a robot according to the collected and analyzed data, further match control data for each motion model and voice model, and construct a structural data frame between interaction information and control data, and when the robot detects interaction information in the real scene, the robot can complete interaction with a person in the real scene under the matching control of the interaction information and the control data, thereby implementing human-computer interaction.

Although the present invention has been described in detail with reference to the preferred embodiments, it will be understood by those skilled in the art that various changes may be made and equivalents may be substituted for elements thereof without departing from the spirit and scope of the present invention.

Claims

1. A human-computer interaction method based on image recognition and reconstruction comprises the following steps:

2. The human-computer interaction method based on image recognition and reconstruction as claimed in claim 1, wherein the step S1 specifically comprises:

3. The human-computer interaction method based on image recognition and reconstruction as claimed in claim 2, wherein: the motion model comprises an expression model and an action model, and the expression model is used for reconstructing a facial image of the robot and controlling the robot to generate corresponding expression changes; the motion model is used for controlling the robot to generate corresponding motion/posture change.

4. The human-computer interaction method based on image recognition and reconstruction as claimed in claim 1, wherein the step S2 is specifically: extracting motion characteristic information of the motion model and voice characteristic information of the voice model according to a time sequence of a standard video sequence, wherein the motion characteristic information is used for controlling the robot to generate corresponding interactive action; the motion characteristic information comprises limb motion characteristic points, limb motion units, facial expression characteristic points and expression motion units, and the voice characteristic information is used for controlling the robot to generate corresponding interactive sound; the voice feature information comprises tone, tone and acoustic signal features of phonemes changing along with time sequence.

5. The human-computer interaction method based on image recognition and reconstruction as claimed in claim 1, wherein: the human-computer interaction method further comprises the step of establishing a database, wherein the establishment of the database at least comprises the establishment of a structural data frame between the motion model and the robot and the establishment of a structural data frame between the voice model and the robot, and the motion model and the voice model are both stored in the database.

6. A human-computer interaction system based on image recognition and reconstruction, comprising:

7. The human-computer interaction system of image-based recognition and reconstruction of claim 6, wherein: the motion control unit further comprises an expression reconstruction module, the expression reconstruction module comprises an expression fitting module and an expression driving module, and the expression fitting module is used for fitting and reconstructing a facial expression model of the robot according to each key point in the standard video sequence; the expression driving module is used for driving the facial expression model to generate corresponding expressions according to the displacement tracks of the key points in the standard video sequence.

8. The human-computer interaction system of image-based recognition and reconstruction of claim 7, wherein: the human-computer interaction system further comprises a storage unit, and the storage unit is electrically connected with the motion control unit and the voice synthesis unit respectively so as to store the motion model, the voice model and the facial expression model.

9. An image-based recognition and reconstruction interactive apparatus, which is a robot including a main body, a head portion movably connected to the main body, and a trunk portion connected to the main body, characterized in that: the robot also comprises a video acquisition module used for acquiring a standard video sequence of a standard user and a personalized video sequence of a target user and characterizing and demodulating the standard video sequence and the personalized video sequence;

10. A computer-readable storage medium storing a computer program, characterized in that: the computer program, when executed by a processor, may implement the human-computer interaction method for image-based recognition and reconstruction as claimed in any one of claims 1 to 6.