CN114185430A

CN114185430A - Human-computer interaction system and method and intelligent robot

Info

Publication number: CN114185430A
Application number: CN202111358580.8A
Authority: CN
Inventors: 刘娜; 袁野; 张赛; 王中磐; 吴国栋
Original assignee: Zhongyuan Power Intelligent Robot Co ltd
Current assignee: Zhongyuan Power Intelligent Robot Co ltd
Priority date: 2021-11-12
Filing date: 2021-11-12
Publication date: 2022-03-15

Abstract

The application discloses a man-machine interaction system, a man-machine interaction method and an intelligent robot, wherein the man-machine interaction system comprises an image acquisition module, a model acceleration module, a model reasoning module and an interaction function module; the image acquisition module is used for initializing parameters of a camera of the intelligent robot based on a preset initialization strategy and controlling the camera to acquire a user face image when the camera interacts with the intelligent robot; the model acceleration module is used for carrying out model conversion on a preset face recognition model to obtain a TensorRT model and establishing a model operation engine based on the TensorRT model; the model reasoning module is used for operating the engine based on the model and identifying and reasoning the face image of the user to obtain a reasoning result; and the interactive function module is used for feeding back interactive action information to the user according to the reasoning result. The embodiment can ensure the accuracy of the face recognition result, improve the face recognition speed and reduce the hardware cost.

Description

Human-computer interaction system and method and intelligent robot

Technical Field

The application relates to the technical field of artificial intelligence, in particular to a human-computer interaction system, a human-computer interaction method and an intelligent robot.

Background

Artificial intelligence technology is the science of technology for simulating, extending and extending human intelligence. The user characteristics can be better extracted by using the deep convolutional neural network in the artificial intelligence technology, so that the intelligent machine can more accurately identify the user behaviors according to the user characteristics, and the man-machine interaction with the user is facilitated.

At present, most of intelligent machines only have a voice recognition function and a networking function, such as an intelligent sound box and an intelligent floor sweeping robot, and the human-computer interaction function is single. However, the deep convolutional neural network has high complexity and computation workload, has high requirements on edge devices, and considers the problem of the landing cost of the edge devices, so that the deep convolutional neural network actually deployed on the intelligent robot hardware is generally difficult to realize real-time, efficient and accurate user identity identification and rapid interaction based on the human face features of the user.

Disclosure of Invention

The application provides a human-computer interaction system, a human-computer interaction method and an intelligent robot, and aims to solve the technical problem that the face recognition effect of an existing deep convolutional neural network on the intelligent robot is poor.

In order to solve the technical problem, an embodiment of the present application provides a human-computer interaction system applied to an intelligent robot, where the human-computer interaction system includes an image acquisition module, a model acceleration module, a model inference module, and an interaction function module;

the image acquisition module is used for initializing parameters of a camera of the intelligent robot based on a preset initialization strategy and controlling the camera to acquire a user face image when the camera interacts with the intelligent robot;

the model acceleration module is used for carrying out model conversion on a preset face recognition model to obtain a TensorRT model and establishing a model operation engine based on the TensorRT model;

the model reasoning module is used for operating an engine based on the model and identifying and reasoning the user face image to obtain a reasoning result;

and the interactive function module is used for feeding back interactive action information to the user according to the reasoning result.

In the embodiment, the image acquisition module is used for carrying out parameter initialization on the camera of the intelligent robot based on the preset initialization strategy, and the image is adopted in a mode capable of adopting model test, so that the quality of a data set is ensured, and the accuracy of a face recognition result is ensured; the model acceleration module is used for carrying out model conversion on a preset face recognition model so as to reduce the complexity of the model, improve the face recognition speed and reduce the hardware cost for adapting the model; and realizing the interaction between the intelligent robot and the user through the model reasoning module and the interactive function model.

In an embodiment, the model acceleration module specifically includes:

the conversion unit is used for carrying out interlayer fusion and precision calibration on the face recognition model to obtain the TensorRT model;

the serialization unit is used for serializing the TensorRT model to obtain an optimized file, and storing the optimized file to a preset storage space;

and the creating unit is used for reading the optimized file in the preset storage space, performing deserialization on the optimized file, and creating the model operation engine according to the deserialized optimized file.

In an embodiment, the model inference module specifically includes:

the detection unit is used for extracting the face features of the face image of the user to obtain first face features and detecting a face area and face key points in the face image of the user according to the first face features;

the alignment unit is used for carrying out angle correction on a face area in the user face image by utilizing a Poinch analysis method and carrying out alignment transformation on face key points in the user face image to obtain a target face image;

and the determining unit is used for extracting the face features of the target face image to obtain second face features, and determining an inference result corresponding to the user face image according to the second face features.

In an embodiment, the detecting unit specifically includes:

the first extraction subunit is used for extracting the overall characteristics of the face image of the user to obtain the overall characteristics;

the second extraction subunit is used for extracting the multi-scale features of the face image of the user according to the overall features to obtain a multi-scale feature map;

and the output subunit is used for outputting the face region and the face key point of the user face image according to the multi-scale feature map based on an SSH algorithm.

In an embodiment, the alignment unit specifically includes:

the first determining subunit is used for determining a face inclination angle between a face area in the user face image and a preset standard area based on a least square method;

the rotation subunit is used for performing rotation transformation on the face area in the user face image according to the face inclination angle;

and the alignment subunit is used for performing alignment transformation on the face key points in the user face image after the rotation transformation to obtain the target face image.

In an embodiment, the determining unit specifically includes:

the enhancement unit is used for horizontally overturning the target face image and splicing the two target face images before and after overturning to obtain a target spliced image;

the third extraction subunit is used for extracting the face features of the target mosaic image to obtain the second face features;

the query subunit is configured to traverse a preset face database, and query a target feature ID with the highest similarity to the second face feature and the similarity greater than a preset threshold;

and the second determining subunit is used for determining an inference result corresponding to the user face image according to the target feature ID.

In an embodiment, the interactive function module specifically includes:

the motion control unit is used for controlling the intelligent robot to move to a target position close to the user according to the user position information in the inference result;

the voice unit is used for combining preset voice information with the user identity information in the inference result to obtain target voice information and broadcasting the target voice information to a user;

and the expression control unit is used for simulating the expression of the user according to the user expression information in the inference result.

In a second aspect, an embodiment of the present application provides a human-computer interaction method, which is applied to an intelligent robot, and the method includes:

initializing parameters of a camera based on a preset initialization strategy, and controlling the camera to acquire a user face image when interacting with the intelligent robot;

performing model conversion on a preset face recognition model to obtain a TensorRT model, and creating a model operation engine based on the TensorRT model;

operating an engine based on the model, and identifying and reasoning the user face image to obtain a reasoning result;

and feeding back interactive action information to the user according to the reasoning result.

In an embodiment, the performing model conversion on a preset face recognition model to obtain a TensorRT model, and creating a model operation engine based on the TensorRT model includes:

carrying out interlayer fusion and precision calibration on the face recognition model to obtain the TensorRT model;

serializing the TensorRT model to obtain an optimized file, and storing the optimized file to a preset storage space;

reading the optimized file in the preset storage space, performing deserialization on the optimized file, and creating the model operation engine according to the deserialized optimized file.

In a third aspect, an embodiment of the present application provides an intelligent robot, including a processor and a memory, where the memory is used to store a computer program, and the processor, when executing the computer program, implements the human-computer interaction method according to the second aspect.

It should be noted that, please refer to the relevant description of the first aspect for the beneficial effects of the second aspect and the third aspect, which are not described herein again.

Drawings

Fig. 1 is a schematic structural diagram of a human-computer interaction system according to an embodiment of the present disclosure;

fig. 2 is a schematic flowchart of a human-computer interaction method according to an embodiment of the present disclosure;

fig. 3 is a schematic structural diagram of an intelligent robot according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

If the record of the related art, most of intelligent machines only have a voice recognition function and a networking function, such as an intelligent sound box and an intelligent floor sweeping robot, the man-machine interaction function is single. However, the deep convolutional neural network has high complexity and computation workload, has high requirements on edge devices, and considers the problem of the landing cost of the edge devices, so that the deep convolutional neural network actually deployed on the intelligent robot hardware is generally difficult to realize real-time, efficient and accurate user identity identification and rapid interaction based on the human face features of the user.

Therefore, the embodiment of the application provides a human-computer interaction system, a human-computer interaction method and an intelligent robot, wherein the image acquisition module is used for carrying out parameter initialization on a camera of the intelligent robot based on a preset initialization strategy, so that an image can be adopted in a mode of model test, the quality of a data set is ensured, and the accuracy of a face recognition result is ensured; the model acceleration module is used for carrying out model conversion on a preset face recognition model so as to reduce the complexity of the model, improve the face recognition speed and reduce the hardware cost for adapting the model; and realizing the interaction between the intelligent robot and the user through the model reasoning module and the interactive function model.

Referring to fig. 1, fig. 1 is a schematic structural diagram of a human-computer interaction system applied to an intelligent robot according to an embodiment of the present application, where the human-computer interaction system includes an image acquisition module 101, a model acceleration module 102, a model inference module 103, and an interaction function module 104;

the image acquisition module 101 is configured to perform parameter initialization on a camera of the intelligent robot based on a preset initialization strategy, and control the camera to acquire a user face image when interacting with the intelligent robot.

Because the performance of the deep learning model needs to be supported by big data and the quality of a data set needs to be ensured, in the field of robots, because the software and hardware design and installation links are different, the equipment for acquiring data before model training is different from the equipment for acquiring data during model use and test, and great difference exists in data acquisition, so that the performance of the model is influenced. The embodiment initializes the intelligent robot through an initialization strategy so as to reduce data difference caused during data acquisition.

The model acceleration module 102 is configured to perform model conversion on a preset face recognition model to obtain a TensorRT model, and create a model operation engine based on the TensorRT model.

When a model inference module of the robot is deployed, the face recognition model obtained by training may be very large and have many parameters, and the deployed machine performance has differences, so that the inference speed is low and the delay is high. So to reduce cost, the model is transformed by the model acceleration module using TensorRT. TensorRT is a high-performance deep learning Inference (Inference) optimizer, can optimize a trained model, and provides low-delay and high-throughput deployment Inference for deep learning application. TensorRT can be used for reasoning and accelerating a super-large scale data center, an embedded platform or an automatic driving platform.

The model inference module 103 is configured to operate an engine based on the model, and perform recognition inference on the user face image to obtain an inference result.

The hi recognition inference is carried out on the face image of the user through a model operation engine established based on a TensorRT model, and the recognition speed can be improved. Optionally, the above-mentioned identification inference process may include face region and key point detection, face rectification alignment, and face characterization and identification.

The interactive function module 104 is configured to feed back interactive action information to the user according to the inference result.

The inference result includes, but is not limited to, user identity information, user expression information, and the like.

In an embodiment, based on the embodiment shown in fig. 1, the model acceleration module 102 specifically includes:

In the present embodiment, the model is an example of a process accelerated by TensorRT, including:

installing TensorRT, and confirming the CUDA version of the equipment; converting the trained model from the pytorch model into a universal ONNX format;

converting the ONNX format model into a TensorRT model for acceleration and deployment, finishing interlayer fusion and precision calibration in the optimization process during model conversion, wherein the output of the step is the optimized TensorRT model aiming at a specific GPU platform and a network model, and the TensorRT model can be stored in a disk or a memory in a serialized mode;

and testing the engine model, deserializing the model file in the step, creating a runtime engine, inputting data (such as pictures outside the test set or the data set), and outputting a classification vector result or a detection result.

In an embodiment, on the basis of the embodiment shown in fig. 1, the model inference module 103 specifically includes:

and the detection unit is used for extracting the face features of the user face image to obtain first face features and detecting a face area and face key points in the user face image according to the first face features.

In this embodiment, the detection unit is correspondingly applied to a face region and key point detection process, the alignment unit is correspondingly applied to a face rectification alignment process, and the determination unit is correspondingly applied to a face characterization and recognition process.

Optionally, the detection unit specifically includes:

the second extraction subunit is used for extracting multi-scale features according to the overall features to obtain a multi-scale feature map;

In this embodiment, the neural network model used for detecting the face region and the key point may be a Retinaface model, and the Retinaface model is encapsulated in the detection unit to implement the process of detecting the face region and the key point. Wherein the Retinaface model comprises: taking a deep convolution neural network such as Mobilene or Resnet and the like as a backbone network to extract the overall characteristics of the picture; extracting multi-scale features by adopting an FPN feature pyramid; a Context Modeling method of an SSH algorithm is introduced, and the score of face classification, a bounding box (namely a face area), an output head part of key point regression and a multiple function (respectively, classification (local) and bounding box (IOU) and face key points) are output.

Illustratively, the specific process in the Retinaface network training includes:

step1, data loading, preprocessing, and initializing parameters of the model. The data loading is to load information in pictures and label and convert the information into a format of Pythrch training for data normalization; the model parameter initialization is to initialize the backbone network backhaul, the fpn feature pyramid network, the ssh detection module and the full connection network layer full network of the model.

Step2, processing the sample data by the model, and outputting embedding which comprises a classification score (probability value), a bounding box coordinate and a key point coordinate;

step3, calculating a loss function according to the embedding and the label data;

and Step4, reversely propagating and adjusting model parameters according to the loss function until the model converges to obtain the detection unit.

Illustratively, the inference process of the detection unit includes: reading a video frame, and cutting a video frame image into an image size of (640 × 640) or (320 × 320); inputting the cut image into a network backbone module for feature extraction, outputting the key point position, the score and a plurality of human face anchors according to the extracted features, then performing NMS (non-maximum suppression) operation on the anchors, and selecting a human face frame (namely the human face region) with the highest quality; and restoring the face frame and the corresponding key points thereof to the corresponding positions of the original frame according to the proportion and displaying the face frame and the corresponding key points.

In an embodiment, on the basis of the embodiment shown in fig. 1, the alignment unit specifically includes:

In this embodiment, the face rectification alignment adopts a pockels transform algorithm, and the pockels analysis method is encapsulated in the alignment unit, so that the alignment unit realizes the face rectification alignment process. The pilfer analysis is a method for analyzing the shape distribution. Mathematically, iteration is repeated to find a standard shape, and the least square method is used to find the affine variation of each sample shape to the standard shape. In the embodiment, the face with the inclined angle is corrected and cut into the face image with the uniform size according to the iterated standard template. The objects for obtaining the parameters of affine change are a point set of key points of the human face and a key point set of the labeling template. The method is characterized in that the Pushing analysis is used for preprocessing original data, a better local change model is obtained to serve as a learning basis of a subsequent model, images are processed through the method, and the structure of the face is more and more obvious through normalization change, namely the positions of face feature clusters are more and more close to the average positions of the face feature clusters.

Illustratively, the specific implementation process of the pockels analysis method in the alignment unit includes:

step1, averaging each sample point i (i ═ 1, 2.. times, N) (i.e. five face keypoint positions: two angles of left eye, right eye, nose and mouth) in N images;

step2, normalizing the sizes of all the shapes, and subtracting the corresponding mean value of each sample point;

step3, calculating the gravity center of the shape in each image according to the decentralized data;

step4, based on the center of gravity and angle, aligns the standard and sample shapes together so that the Peter distance of the two shapes is minimal.

Specifically, the standard shape of each image is obtained by calculating the average value of all normalized sample points in each image; calculating the rotation angle from the sample shape to the standard shape in each image by using a least square method; according to the rotation angle, the shape of the sample is subjected to rotation change, and a new shape aligned with the standard shape is obtained; the above steps are repeated until a specified number of cycles is reached or the absolute norm of the canonical shape between two iterations meets a certain threshold.

In an embodiment, the determining unit specifically includes:

In this embodiment, an arcfac face recognition model is selected in the face characterization and recognition process, and the arcfac face recognition model is packaged in the determination unit, so that the determination unit realizes the face characterization and recognition process. The arcface face recognition model adopts a new measurement function Additional Metric Loss to solve the cosine distance between two features, then strengthens the difference between classes and improves the recognition effect. It mainly comprises: extracting the characteristics of the aligned and corrected face image through a depth convolution neural network such as a mobilene network or a ResnetIR network; and (4) solving the cosine similarity between the extracted features and the known face features in the database, and further matching a closest known user ID (target feature ID).

Illustratively, the process flow of the arcface face recognition model includes:

step1, enhancing the face image: horizontally overturning the picture, and splicing the two overturned images to obtain a spliced image;

step2, extracting 512-dimensional features of the spliced image to be used as human face representation;

step3, face feature matching: and traversing the feature information in the database, acquiring a target feature ID with the highest similarity (greater than a preset threshold) with the features to be identified, and outputting and displaying the ID after matching.

In an embodiment, the interactive function module 104 specifically includes:

In this embodiment, the interactive function module includes a motion control module, a voice module, an expression control module, and the like, and performs corresponding interactive actions on the user according to the inference result, for example, approaching the user through the motion control module, simulating human expressions through the expression control module, and reporting the name of the user when the voice module plays voice.

Referring to fig. 2, fig. 2 shows a flowchart illustrating a human-computer interaction method, which may be applied to an intelligent robot according to an embodiment of the present application, and as shown in fig. 2, the method includes steps S201 to S204.

Step S201, initializing parameters of a camera based on a preset initialization strategy, and controlling the camera to acquire a user face image when interacting with the intelligent robot;

step S202, performing model conversion on a preset face recognition model to obtain a TensorRT model, and creating a model operation engine based on the TensorRT model;

step S203, operating an engine based on the model, and identifying and reasoning the user face image to obtain a reasoning result;

and step S204, feeding back interactive action information to the user according to the inference result.

In an embodiment, based on the embodiment shown in fig. 2, the step S202 includes:

It should be understood that, for the explanation of the steps of the human-computer interaction method of the embodiment, reference may be made to the description of the human-computer interaction system in fig. 1, and details are not described herein again.

Fig. 3 is a schematic structural diagram of an intelligent robot according to an embodiment of the present application. As shown in fig. 3, the intelligent robot 3 of this embodiment includes: at least one processor 30 (only one shown in fig. 3), a memory 31, and a computer program 32 stored in the memory 31 and executable on the at least one processor 30, the processor 30 implementing the steps of any of the above-described method embodiments when executing the computer program 32.

The intelligent robot may include, but is not limited to, a processor 30, a memory 31. Those skilled in the art will appreciate that fig. 3 is merely an example of the intelligent robot 3, and does not constitute a limitation of the intelligent robot 3, and may include more or less components than those shown, or combine some components, or different components, such as input and output devices, network access devices, and the like.

The Processor 30 may be a Central Processing Unit (CPU), and the Processor 30 may be other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic device, discrete hardware component, etc. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

The memory 31 may in some embodiments be an internal storage unit of the intelligent robot 3, such as a hard disk or a memory of the intelligent robot 3. The memory 31 may also be an external storage device of the Smart robot 3 in other embodiments, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like, which are provided on the Smart robot 3. Further, the memory 31 may also include both an internal storage unit and an external storage device of the intelligent robot 3. The memory 31 is used for storing an operating system, an application program, a BootLoader (BootLoader), data, and other programs, such as program codes of the computer program. The memory 31 may also be used to temporarily store data that has been output or is to be output.

In addition, an embodiment of the present application further provides a computer-readable storage medium, where a computer program is stored, and when the computer program is executed by a processor, the computer program implements the steps in any of the method embodiments described above.

The embodiment of the present application provides a computer program product, which when running on an intelligent robot, enables the intelligent robot to implement the steps in the above method embodiments when executed.

In several embodiments provided herein, it will be understood that each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved.

The functions, if implemented in the form of software functional modules and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application or portions thereof that substantially contribute to the prior art may be embodied in the form of a software product stored in a storage medium and including instructions for causing a terminal device to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.

The above-mentioned embodiments are further detailed to explain the objects, technical solutions and advantages of the present application, and it should be understood that the above-mentioned embodiments are only examples of the present application and are not intended to limit the scope of the present application. It should be understood that any modifications, equivalents, improvements and the like, which come within the spirit and principle of the present application, may occur to those skilled in the art and are intended to be included within the scope of the present application.

Claims

1. A man-machine interaction system is characterized by comprising an image acquisition module, a model acceleration module, a model reasoning module and an interaction function module;

2. The human-computer interaction system of claim 1, wherein the model acceleration module specifically comprises:

3. The human-computer interaction system of claim 1, wherein the model inference module specifically comprises:

and the determining unit is used for extracting the face features of the target face image to obtain second face features, and determining the inference result corresponding to the user face image according to the second face features.

4. The human-computer interaction system of claim 3, wherein the detection unit specifically comprises:

the second extraction subunit is used for extracting the multi-scale features of the user face image according to the overall features to obtain a multi-scale feature map;

and the output subunit is used for outputting the face region and the face key point in the user face image according to the multi-scale feature map by using a preset SSH algorithm.

5. The human-computer interaction system of claim 3, wherein the alignment unit specifically comprises:

the first determining subunit is used for determining a face inclination angle between a face area in the user face image and a preset standard area by using a least square method;

6. The human-computer interaction system of claim 3, wherein the determining unit specifically comprises:

7. The human-computer interaction system of claim 1, wherein the interaction function module specifically comprises:

8. A human-computer interaction method is applied to an intelligent robot, and comprises the following steps:

9. The human-computer interaction method of claim 8, wherein the performing model conversion on the preset face recognition model to obtain a TensorRT model, and creating a model running engine based on the TensorRT model comprises:

10. An intelligent robot, characterized by comprising a processor and a memory for storing a computer program which, when executed by the processor, implements the human-machine interaction method of claim 8 or 9.