CN116931733A

CN116931733A - Information interaction method, device and system

Info

Publication number: CN116931733A
Application number: CN202310916000.5A
Authority: CN
Inventors: 丁先
Original assignee: Bank of China Ltd
Current assignee: Bank of China Ltd
Priority date: 2023-07-25
Filing date: 2023-07-25
Publication date: 2023-10-24

Abstract

The application discloses an information interaction method, device and system, wherein the method provides a reference model set, and each reference model in the set is a model obtained by respectively modeling three-dimensional objects of reference objects in different postures. When information interaction is carried out, the current interaction end selects a target reference model matched with the current gesture of the target object from the set, models a three-dimensional object model of the target object in the current gesture according to real-time deformation information of the target object in the current gesture compared with the target reference model or a variant thereof, and projects a three-dimensional image of the target object in the current gesture to a physical space where the interaction opposite end is located. According to the application, the deformation of the target object caused by the gesture change is parameterized by using the reference model or the variant thereof, so that the estimation of the surface of the object is continuously reconstructed along with the time, and the three-dimensional reconstruction is accelerated to a real-time level, thereby realizing the cross-space face-to-face communication experience between the two parties and achieving the purpose of improving the interaction efficiency and quality.

Description

Information interaction method, device and system

Technical Field

The application belongs to the technical field of remote interaction, and particularly relates to an information interaction method, device and system.

Background

Currently, customer service systems in various industries mainly use forms of text communication, voice dialing and video telephone to realize communication with customers. In the communication form, text communication sometimes causes a customer to misunderstand information to be transmitted by a customer service end due to ambiguity, voice dialing communication sometimes causes communication to be completed in time due to delayed sound reception, and video telephone communication sometimes causes a problem of fuzzy image quality. When these problems occur, the communication efficiency and quality between customer service and clients are reduced, and the communication modes do not support more realistic perceptual communication of the two interacting parties.

Disclosure of Invention

In view of the above, the present application provides an information interaction method, device and system, which are used for overcoming at least part of the defects existing in the existing interaction method, improving the interaction efficiency and quality, enabling the interaction party to feel that the opposite end is in the same physical space with the interaction party in an immersive manner, and realizing the cross-space and space "face-to-face" interaction experience between the two parties.

The specific scheme is as follows:

an information interaction method, comprising:

acquiring three-dimensional object information corresponding to a target object in a current gesture;

acquiring a target reference model which is determined from a preset reference model set and is matched with the current gesture of the target object, or acquiring a variant of the target reference model; the reference models in the reference model set are models obtained by respectively carrying out three-dimensional object modeling on reference objects in different postures in advance;

According to the three-dimensional object information, determining deformation information of the target object which is deformed compared with the target reference model or the variant thereof under the current gesture;

and generating a three-dimensional object model corresponding to the target object in the current gesture according to the target reference model or the variant thereof and the deformation information, so as to project a three-dimensional image of the target object in the current gesture to a physical space where an interaction opposite end is positioned based on the three-dimensional object model.

Optionally, the obtaining three-dimensional object information corresponding to the target object in the current gesture includes:

acquiring three-dimensional object information of the target object in the current gesture acquired from a scene where the target object is located by adopting multi-type acquisition equipment;

wherein the three-dimensional object information includes: and under the current gesture, at least part of information in the color image, the depth image and the texture information which are respectively corresponding to the target object under different view angles of the scene.

Optionally, the determining, according to the three-dimensional object information, deformation information of the target object deformed compared with the target reference model or the variant thereof in the current pose includes:

According to the color image and the depth image which correspond to the target object under the current gesture under different visual angles of the scene, determining the spatial position information of each pixel point on the image acquired under the current gesture;

according to the spatial position information of each pixel point, determining deformation parameter information of non-rigid deformation of the target object compared with the target reference model or the variant thereof under the current gesture;

wherein the deformation parameter information of the non-rigid deformation includes: position increment information corresponding to each position on the target object when the position of each position is changed compared with the corresponding position of the target reference model or the variant thereof; variants of the target reference model are: and a model frame obtained by performing historical three-dimensional modeling on the target object based on the historical gesture of the target object in the current period of the current gesture and the target reference model.

Optionally, the determining spatial position information of each pixel point on the image collected in the current gesture according to the color image and the depth image respectively corresponding to the target object in different view angles of the scene in the current gesture includes:

determining depth data of different pixel points on each color image under different view angles according to the depth image and texture information respectively corresponding to the target object under different view angles of the scene under the current gesture;

And determining the spatial position information corresponding to each pixel point on each color image under the current gesture according to the depth data and the plane position information of different pixel points on each color image.

Optionally, the generating, according to the target reference model or the variant thereof and the deformation information, a three-dimensional object model corresponding to the target object in the current pose includes:

generating a three-dimensional grid model corresponding to the target object under the current gesture according to the target reference model or the variant thereof and position increment information corresponding to each position point on the target object when the position of each position point is changed compared with the corresponding position point of the reference model or the variant thereof;

and performing color and/or texture mapping and rendering processing on the three-dimensional model framework corresponding to the three-dimensional grid model according to the color image and/or texture information under the multi-view angle to obtain a three-dimensional object model corresponding to the target object under the current gesture.

Optionally, before generating the three-dimensional object model corresponding to the target object in the current pose, the method further includes:

acquiring an environment background image corresponding to a target object in a current gesture and a fusion image containing the environment background and the target object;

And determining a foreground image of the target object according to the environment background image and the fusion image, so as to generate a three-dimensional object model matched with the foreground view angle of the target object for the target object based on the foreground image, and correspondingly enabling the foreground part of the target object to face the interaction opposite end when the three-dimensional image of the target object is projected to the interaction opposite end.

Optionally, the method further comprises:

acquiring audio data of a synchronously acquired target object in a current gesture;

when the three-dimensional image of the target object is projected, synchronizing sound information of the target object to an interaction opposite terminal based on the audio data;

the audio data comprises an audio stream and a gesture stream, wherein the gesture stream is used for representing the head gesture of the target object so as to synchronously output sound information with spatial sound effect of the target object at the opposite interaction end based on the audio stream and the gesture stream.

Optionally, the target object is a customer service person, and the information interaction method is executed through customer service processing equipment and a server;

generating a three-dimensional grid model of the target object at the customer service processing equipment, and transmitting data to be transmitted to the server for processing; the data to be transmitted comprises: the color image, texture information and at least part of information in a foreground image of the target object corresponding to the target object under multiple viewing angles, and the three-dimensional grid model and synchronously acquired audio data;

The server renders a three-dimensional model frame corresponding to the three-dimensional grid model based on the received color image, texture information and at least part of information in the foreground image, projects a three-dimensional image corresponding to the target object in the current gesture to a physical space where the interaction opposite terminal is located based on the three-dimensional object model obtained by rendering, and synchronizes sound information of the target object to the interaction opposite terminal based on the received audio data.

Optionally, before transmitting data to the server, the customer service processing device further renders a foreground part of a three-dimensional model frame corresponding to the three-dimensional grid model based on a foreground image of a target object, and/or compresses the data to be transmitted;

wherein the compression process at least includes: determining repeated data of each site of the three-dimensional grid model of the target object compared with the historical time sequence point at the current time sequence point, and eliminating the repeated data;

the server acquires the data of the missing sites from the three-dimensional grid model of the corresponding historical time sequence points aiming at the missing site data generated by eliminating the repeated data.

Optionally, the projecting the three-dimensional image corresponding to the target object in the current gesture to the physical space where the interaction opposite end is located includes:

Acquiring physical space information of an interactive object;

and positioning a space anchor point according to the physical space information so as to project a three-dimensional image of the target object corresponding to the three-dimensional object model under the current gesture according to the anchor point obtained by positioning.

An information interaction device, comprising:

the first acquisition unit is used for acquiring three-dimensional object information corresponding to the target object in the current gesture;

the second acquisition unit is used for acquiring a target reference model which is determined from a preset reference model set and is matched with the current gesture of the target object or acquiring a variant of the target reference model; the reference models in the reference model set are models obtained by respectively carrying out three-dimensional object modeling on reference objects in different postures in advance;

a determining unit configured to determine deformation information of the target object that is deformed in a current pose compared to the target reference model or a variant thereof, based on the three-dimensional object information;

the model generation and projection processing unit is used for generating a three-dimensional object model corresponding to the target object in the current gesture according to the target reference model or the variant thereof and the deformation information so as to project a three-dimensional image of the target object in the current gesture to a physical space where the interaction opposite end is positioned based on the three-dimensional object model.

An information interaction system, comprising:

a client;

the server can be used for carrying out information interaction with the client;

in the process of information interaction with the client, the server projects a three-dimensional image corresponding to the target object in the current gesture to the physical space where the client is located by executing the information interaction method.

In summary, the information interaction method, the information interaction device and the information interaction system provided by the application provide a reference model set, wherein the reference models in the set are models obtained by respectively carrying out three-dimensional object modeling on reference objects in different postures in advance. On the basis, when information interaction is carried out, the current interaction end selects a target reference model matched with the current gesture of the target object from the reference model set, models a three-dimensional object model of the target object in the current gesture according to real-time deformation information of the target object in the current gesture compared with the target reference model or a variant thereof, and projects a three-dimensional image of the target object in the current gesture to a physical space where the interaction opposite end is located according to the generated three-dimensional object model.

According to the application, the deformation of the target object caused by the real-time posture change is parameterized by using the reference model or the variant thereof, so that the estimation of continuously reconstructing the surface of the target object along with time is realized, and the three-dimensional reconstruction is accelerated to a real-time level, so that the opposite ends of the interaction double-sided communication system can be perceived as being in the same physical space as the opposite ends of the interaction double-sided communication system in an immersive manner, the cross-space-time face-to-face communication experience between the two parties is realized, and the purposes of improving the interaction efficiency and the interaction quality are realized.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings that are required to be used in the embodiments or the description of the prior art will be briefly described below, and it is obvious that the drawings in the following description are only embodiments of the present application, and that other drawings can be obtained according to the provided drawings without inventive effort for a person skilled in the art. .

FIG. 1 is a schematic flow chart of an information interaction method provided by the application;

FIG. 2 is a schematic flow chart of another information interaction method provided by the application;

FIG. 3 is a schematic flow chart of another information interaction method provided by the application;

FIG. 4 is a real-time remote interaction processing logic provided by the present application;

fig. 5 is a structural diagram of the information interaction device provided by the application.

Detailed Description

The following description of the embodiments of the present application will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present application, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.

The application discloses an information interaction method, device and system, which are used for realizing cross-space and 'face-to-face' communication experience between two parties by enabling the two parties to feel that opposite ends are in the same physical space with the two parties in an immersive manner through a 3D reconstruction technology aiming at remote service communication of customer service-customers, remote teaching, remote conference and other scenes. The embodiment of the application mainly takes customer service (such as bank customer service) as an example to carry out scheme explanation on remote service communication of customers.

Referring to a flowchart of an information interaction method shown in fig. 1, the information interaction method provided by the application includes the following processing flows:

step 101, obtaining three-dimensional object information corresponding to the target object in the current gesture.

In the information interaction of the two interaction parties, the target object is specifically an object to be used for projecting the corresponding real-time image to the interaction opposite end, for example, customer service personnel, teachers, participants and the like, and can depend on the actual application scene.

Taking a target object as a customer service person as an example, the application adopts a real-time 3D reconstruction technology to project a real-time 3D customer service person image to the physical world space of a customer in real time, so that the customer and the customer service can interact face to face in the same space. Correspondingly, in the step 101, three-dimensional object information corresponding to the target object in the current gesture is obtained, so as to be used for performing 3D reconstruction on the target object.

The 3D reconstruction is based on vision, which means that a camera is used for acquiring a data image of a scene object, the image is analyzed and processed, and then three-dimensional information of the object in a real environment is deduced by combining with computer vision knowledge.

The three-dimensional object information of the target object in the current gesture acquired from the scene where the target object is located by adopting the multi-type acquisition device can be acquired, and the acquired three-dimensional object information includes but is not limited to: at least partial information in the color image, the depth image and the texture information which are respectively corresponding to the target object under different view angles of the scene under the current gesture. The multi-type acquisition device correspondingly comprises, but is not limited to, an RGB camera, a depth camera, an infrared acquisition device (such as a random infrared point-array camera) and the like.

The current mainstream 3D reconstruction technology includes a static 3D reconstruction technology and a multi-view point cloud 3D reconstruction technology, wherein the static 3D reconstruction technology can reconstruct a relatively complete surface, but is limited to off-line non-real-time processing, the processing speed is quite slow, and it takes several minutes to process a depth image, so that the reconstruction work of a hundred-frame motion sequence can take several hours. The latter, although realized with improved performance, results in poor stereoscopic effect due to holes and offset pixels.

Aiming at the problems, the method acquires information of the target object from the scene where the target object is located based on multiple view angles, and obtains the information of the three-dimensional 360-degree global view angle range of the target object as much as possible, so as to avoid the problem of holes or deviation pixels in 3D reconstruction caused by view angle limitation or shielding of acquisition equipment as much as possible.

Taking customer service personnel as an example, a customer service end scene acquisition environment can be built in advance, and optionally, professional acquisition equipment can be built in the acquisition environment to acquire information of target objects (customer service real people) in the scene. The collection devices built in the scene can be used for but not limited to collecting depth information, RGB color information, texture information and the like of a target object in the scene, by configuring multiple groups of collection devices in the scene where the target object is located, 360-degree dead angle free shooting is performed on the target object, for example, 8 groups of collection devices are configured at different view angles up and down/left and right in the stereoscopic space, each group of collection devices comprises 1 RGB camera, 2 NIR depth cameras and 1 structural light sensor which are arranged at the same view angle, and an infrared collection device (such as a random infrared point array camera) is added beside the camera device (such as an RGB camera) of each group of collection devices, and preferably, the RGB camera is configured with more than 2000 ten thousand pixels. Based on a plurality of groups of built RGB cameras, RGB plane images of the global view angle of the target object are collected simultaneously, a plurality of groups of NIR depth cameras/structure light sensors collect depth images of the global view angle of the target object simultaneously, a plurality of groups of infrared collection devices collect texture information of the global view angle of the target object simultaneously, and accordingly a group of plane image sequences of the target object can be obtained through multi-angle shooting for each moment point, and the image sequences are data input of a reconstruction system. Each acquisition camera in the scene is used for focusing operation on a target object such as customer service, so that information such as appearance, limb actions and the like of the customer service can be acquired at multiple angles.

102, acquiring a target reference model which is determined from a preset reference model set and is matched with the current gesture of the target object, or acquiring a variant of the target reference model; the reference models in the reference model set are models obtained by respectively carrying out three-dimensional object modeling on reference objects in different postures in advance.

The embodiment of the application provides a reference model set, wherein reference models in the set are models obtained by respectively carrying out three-dimensional object modeling on reference objects in different postures in advance. For example, a reference object (such as a specified customer service) is pre-stored in a model library, and human body 3D models corresponding to the reference object under various postures (such as squatting, sitting, standing, twisting, bending and rolling) are respectively provided for providing a model basis for the subsequent real-time 3D reconstruction of the target object, so that the purposes of accelerating the 3D reconstruction process and reducing time consumption are correspondingly achieved.

In the information interaction scene such as real-time communication between customer service and customers, aiming at real-time 3D reconstruction of a target object such as customer service personnel, firstly, identifying the current gesture of the target object based on gesture perception related technology, wherein the current gesture is the real-time gesture of the target object such as sitting gesture, standing gesture and the like, and screening a target reference model matched with the current gesture of the target object from a reference model set, or directly acquiring a variant of the target reference model to serve as a model base for real-time 3D reconstruction of the target object.

Wherein the variants of the target reference model are: and a model frame obtained by performing historical three-dimensional modeling on the target object based on the historical gesture of the target object and the target reference model in the current period of the current gesture.

And step 103, determining deformation information of the target object, which is deformed compared with the target reference model or the variant thereof under the current gesture, according to the three-dimensional object information.

The application takes the matched target reference model or the obtained variant of the target reference model as a model basis, combines the real-time three-dimensional object information of the target object under the corresponding gesture, parameterizes the deformation of the target object caused by the real-time gesture change, and specifically parameterizes the non-rigid deformation of the target object so as to realize the prediction of the reconstructed object surface with time improvement.

The method comprises the steps of determining spatial position information of each pixel point on an image acquired under a current gesture according to a color image and a depth image which respectively correspond to a target object under different visual angles of a scene under the current gesture, and determining deformation parameter information of non-rigid deformation of the target object under the current gesture compared with a target reference model or a variant thereof according to the spatial position information of each pixel point;

The deformation parameter information of the non-rigid deformation at least comprises: each position on the target object is compared with corresponding position increment information when the corresponding position of the target reference model or the variant thereof changes in position, and the position increment information further comprises corresponding spatial position coordinate change values and change directions.

Specifically, based on depth data and plane position information of pixel points on each color image (RGB image) under different viewing angles, spatial position information corresponding to each pixel point on each color image under the current gesture can be determined. When determining the depth data of different pixel points on each color image, the embodiment of the application determines the depth data of different pixel points on each color image under different view angles according to the depth image and texture information respectively corresponding to the target object under the current gesture under different view angles of the scene. Namely, depth estimation based on depth images is assisted by adopting corresponding visual angle texture information acquired based on acquisition equipment such as random infrared point array cameras, so that the depth estimation accuracy of each pixel point on different RGB images under different visual angles is further improved, the accuracy of real-time 3D modeling of a target object is further improved, and the problem of poor stereoscopic effect caused by holes and deviation pixels is solved to a higher degree.

The method ensures the real-time communication between the two parties of interaction through real-time depth estimation, wherein the depth estimation specifically estimates the depth information of the three-dimensional scene object in the plane image, thereby facilitating the understanding of the scene, the perception of the object depth and the three-dimensional reconstruction of the object in the scene. In order to ensure depth estimation from multiple viewpoints and angles and to ensure consistency of result data information, the depth estimation is performed in real time. According to the application, an infrared acquisition device (such as a random infrared point array camera) is added beside a camera device (such as an RGB camera) of each group of acquisition devices to serve as a texture acquisition device, a plurality of infrared acquisition devices which encircle 360 degrees are used for illuminating the whole scene, multi-view texture information acquisition is carried out on a target object, depth estimation based on a depth image is assisted, and depth estimation is carried out on a plane image in a mode of combining texture assistance and multi-angle depth estimation, so that the accuracy of depth estimation is improved.

In addition, in order to achieve real-time performance of depth estimation, the embodiment may optionally perform independent initialization on each pixel point in each frame image (for example, initialize the depth of each pixel point on each view angle RGB image to be the depth information of the corresponding matching point on the target reference model), divide the image (RGB image, depth image) into a plurality of equal division areas according to the rule of the same size, perform parallel processing on different equal division areas by using concurrency of high performance GPU multithread, obtain the depth data of the pixel points of different areas through parallel computation, for example, perform parallel computation on the different equal division areas of the whole row and whole column in the image in four directions from left to right, from top to bottom, from right to left, and from bottom to top, to obtain the depth data of the pixels of different areas, etc., preferably, when parallel processing is performed on the different equal division areas of the same image in multiple directions, mark the processed area in the processing process, and if the area traversed from other directions is the processed area, skip the processing is not repeated. And finally, noise difference value removal and edge protection smoothing processing are carried out on the depth data, so that high-quality and real-time intensive depth mapping is realized based on a CUDA parallel computing architecture of the GPU.

And 104, generating a three-dimensional object model corresponding to the target object in the current gesture according to the target reference model or the variant thereof and the deformation information, so as to project a three-dimensional image corresponding to the target object in the current gesture to a physical space where an interaction opposite end is positioned based on the three-dimensional object model.

Then, deformation information of the target object corresponding to the current posture may be further superimposed on the target reference model, or deformation information of the target object corresponding to the current posture may be superimposed on a variant of the target reference model, and a three-dimensional object model of the target object corresponding to the current posture may be generated based on the result of the superimposition.

Specifically, if the current posture of the target object is different from the posture type of the previous moment (for example, the current posture is changed from sitting to standing), and the reference model matched with the current posture is changed from the reference model adopted at the previous moment, the target reference model matched with the current posture is obtained from the reference model set, and the three-dimensional object model corresponding to the target object under the current posture is generated based on the deformation information of the deformation of the target reference model and the current posture of the target object compared with the target reference model.

Otherwise, if the current pose of the target object is the same as the pose type of the previous moment (i.e. the change of the real-time pose is small and insufficient to induce the change of the pose type), preferably, the three-dimensional object model frame of the previous moment can be obtained as a model base, and the three-dimensional object model frame of the previous moment is essentially the variant (variant model) of the target reference model matched with the current pose, and correspondingly, based on the variant model and the deformation information of the current pose of the target object, which is deformed compared with the variant model, the corresponding three-dimensional object model of the target object under the current pose is generated. In this case, the target reference model may be used as a model base, and the real-time 3D reconstruction under the current posture of the target object may be performed by the deformation information of the target reference model and the current posture of the target object compared with the deformation information, which is not limited.

That is, when the real-time three-dimensional modeling is performed on the target object by fusing the GRB image data, the depth map data and the texture data and further combining the basic model, the non-rigid motion (i.e., the non-rigid deformation) between each frame can be estimated, and the real-time depth data of each frame is fused into the target reference model in an incremental manner based on the non-rigid motion between each frame, so that the three-dimensional model of the target object generated on line is continuously deformed, thereby realizing the continuous parameterization of the real-time non-rigid deformation of the target object based on the target reference model and realizing the estimation of the surface of the reconstructed object continuously with time improvement. Meanwhile, the three-dimensional reconstruction can be better accelerated to a real-time level by further combining with the high concurrence processing capability of the GPU, and the real-time three-dimensional object reconstruction under a more complex scene is supported.

The current dynamic three-dimensional reconstruction technology is also limited to off-line non-real-time processing, has low processing speed and long reconstruction time period, can not realize 3D reconstruction of audio and video synchronization, and correspondingly leads to the fact that the customer service systems of the current commercial banks and the like mainly use text communication, voice dialing and video to realize communication of telephone and video between customer service and clients, and lack of face-to-face perceptual communication. According to the application, the real-time non-rigid deformation of the target object is continuously parameterized by taking the reference model or the variant thereof as the model basis, so that the prediction of the surface of the object to be continuously reconstructed (or the high concurrent processing capacity of the GPU is combined at the same time) is improved along with time, the three-dimensional reconstruction is accelerated to a real-time level, the problem in the prior art is effectively solved, and the interaction two parties can realize cross-space-time face-to-face communication.

Optionally, when the three-dimensional object model corresponding to the target object in the current posture is generated according to the target reference model or the variant thereof and the deformation information, firstly, according to the target reference model or the variant thereof and position increment information corresponding to each pixel point on the acquired image when the position of the pixel point is changed compared with the corresponding position of the target reference model or the variant of the target reference model, a three-dimensional grid model corresponding to the target object in the current posture is generated, then, according to the color image and/or texture information in multiple views, mapping processing of colors and/or textures is performed on the three-dimensional model frame corresponding to the three-dimensional grid model, and color and/or texture rendering are combined, so that the three-dimensional model surface corresponding to the target object in the current posture is provided with textures and colors, and accordingly the three-dimensional object model corresponding to the target object in the current posture is obtained. The three-dimensional object model reflects the current real-time gesture and appearance of a target object such as customer service personnel, and even further object detail information such as human body surface color, texture and the like.

The three-dimensional grid model comprises the spatial position coordinates of each point on the three-dimensional model frame of the target object, or alternatively, the three-dimensional grid model can implicitly represent the model architecture surface of the three-dimensional model of the target object to be reconstructed by storing the distance vector information of each point on the target reference model or its variant model architecture from the corresponding point on the pre-reconstructed model surface (i.e. the position increment information of the point on the target object compared with the corresponding point of the target reference model or its variant).

In addition, optionally, a corresponding index may be established for each point, so that the stereoscopic mesh model includes a stereoscopic model structure of the target object and a corresponding relationship between the index of each point and a point value (such as a spatial position coordinate or a distance vector), and the index of each point includes, but is not limited to, an area identifier/number of an area (such as a head, a leg, etc.) to which the point belongs and a number of the point.

Then, based on the three-dimensional object model obtained by the real-time three-dimensional modeling, a three-dimensional image corresponding to the target object in the current gesture can be projected to the physical space where the interaction opposite end is located, for example, a real-time three-dimensional image corresponding to the customer service real person is projected to the physical space where the customer is located.

Alternatively, the projection of the three-dimensional image of the target object can be realized by a spatial anchor point positioning technology. The method specifically can acquire physical space information of the interactive object, such as acquiring a multi-view environment image of a scene where the interactive object is positioned, and the like, and perform space anchor point positioning through a space anchor point positioning technology according to the acquired physical space information, and on the basis, according to the anchor point obtained by positioning, the three-dimensional image of the target object corresponding to the real three-dimensional object model under the current gesture is projected.

In addition, when the 3D stereoscopic image of the target object (such as customer service) is presented in the physical space where the interaction opposite end (such as the client) is positioned based on the projection of the three-dimensional image, the presentation result can be dynamically adjusted in real time by combining different gestures of the customer, so that the stability of the presentation result is improved.

The client can use special equipment as a receiving end, and combines space anchor point positioning processing running on corresponding processing equipment to realize remote customer service real people after three-dimensional reconstruction is displayed in a physical space corresponding to the client.

The special device may be, but is not limited to, MR (mixed reality), VR (virtual reality) or AR (augmented reality) and the like, preferably, the client may wear MR glasses (such as holonens) as a receiving end, which has better stereoscopic experience effect, so that the real person in the center of the immersive experience customer can experience the same physical world with himself to a higher degree, and the customer service end may also wear MR glasses to perceive the 3D stereoscopic image of the client (of course, the customer service end may not wear), and the two parties can realize cross-space time "face-to-face" communication by means of special devices such as MR glasses and the like.

It should be noted that if the processing performance of the processing device of the interaction opposite end (such as the client) is strong enough, the generated stereoscopic grid model and the matched color information and texture information thereof can be transmitted to the processing device of the interaction opposite end for performing mapping and rendering of the color/texture, and the three-dimensional image projection processing based on the spatial anchor point positioning technology, without limitation.

In summary, the information interaction method provided by the application provides a reference model set, wherein the reference models in the set are models obtained by respectively carrying out three-dimensional object modeling on reference objects in different postures in advance. On the basis, when information interaction is carried out, the current interaction end selects a target reference model matched with the current gesture of the target object from the reference model set, models a three-dimensional object model of the target object in the current gesture according to real-time deformation information of the target object in the current gesture compared with the target reference model or a variant thereof, and projects a three-dimensional image of the target object in the current gesture to a physical space where the interaction opposite end is located according to the generated three-dimensional object model.

Optionally, in an embodiment, referring to fig. 2, before generating the three-dimensional object model corresponding to the target object in the current pose, the information interaction method provided by the present application may further include the following processing:

step 201, acquiring an environment background image corresponding to the target object in the current gesture and a fusion image containing the environment background and the target object.

Step 202, determining a foreground image of the target object according to the environmental background image and the fusion image, so that a three-dimensional object model matched with the foreground view angle of the target object is generated for the target object in step 104, and correspondingly, when the three-dimensional image of the target object is projected to the interaction opposite end, the foreground part of the target object faces the interaction opposite end.

The foreground of the target object is relative to the background of the target object in the scene, wherein the green curtain can be used as the background, or the green curtain can be not used, and the real scene background in the natural environment can be maintained. The foreground image of the target object is correspondingly an image corresponding to the foreground part of the target object, for example, if the target object faces away from the scene background, the front image of the target object is the foreground image thereof, and if the target object faces away from the scene background, the image of the side of the target object away from the scene background is the foreground image thereof.

The foreground image of the target object comprises an RGB image and a depth image. Optionally, the background information (the depth map and the RGB map of the empty scene) of the scene can be extracted and recorded from the collected images (the RGB images and the depth images) of the cameras, the data of the whole scene (including the scene background and the target object) can be obtained from the collected images (the RGB images and the depth images) of the cameras, and then the foreground map extraction of the target object can be realized by combining the GPU with the average field inference algorithm.

Subsequently, when the three-dimensional object model corresponding to the target object in the current gesture is generated in step 104, the three-dimensional object model matched with the foreground view angle of the target object is generated for the target object according to the target reference model or the variant thereof and the deformation information of the target object deformed in the current gesture and the generated foreground map, and accordingly when the three-dimensional image of the target object is projected to the physical space where the interaction opposite end is located, the foreground part of the target object is always made to face the interaction opposite end, so that both interaction sides have better interaction experience.

Optionally, in an embodiment, referring to fig. 3, the information interaction method provided by the present application may further include the following processing:

Step 301, acquiring audio data of a synchronously acquired target object in a current gesture.

Audio is the basic medium of communication with each other, and audio and visual needs to be synchronized and matched for optimal immersion. According to the application, a series of audio acquisition devices such as microphones are arranged around different positions and different angles of a scene where a target object is located, so that the audio acquisition is carried out on the voice information of customer service personnel.

Further, the acquired audio data includes an audio stream and a gesture stream for characterizing a head gesture of the target object, wherein the audio stream may contain one or more audio frames of the target object, and the gesture stream may include one or more sets of head-region spatial position coordinates or orientation information of the target object.

In the case where the collected audio data includes interfering sounds, for example, the target object is in a non-independent space, and there is an interfering sound source, the interfering sound filtering may be further performed according to the voiceprint feature and/or the sound intensity feature of the target object (which is generally considered to be the greatest of all the collected sounds).

Step 302, synchronizing sound information of the target object to the interaction opposite terminal based on the audio data when the three-dimensional image of the target object is projected.

On the basis of synchronously collecting the audio data, when the three-dimensional image of the target object is projected to the physical space where the interaction opposite terminal is located, the sound information of the target object is synchronously collected to the interaction opposite terminal based on the collected audio data.

The method specifically can analyze the position, the direction and other information of the sound emitted by the target object by combining the gesture stream, further combine the space anchor point positioning technology, take the positioned anchor point as a reference, combine the identified sound position and direction information, and output the sound of the target object at the interaction opposite terminal, so that the output sound information has the space sound effect related to the position, and correspondingly the interaction opposite terminal can experience the realism like in the aspects of left and right audio, space sound effect, audio direction and the like.

Optionally, the information interaction method provided by the embodiment of the application can be executed through the cooperation of the customer service side processing equipment and the server aiming at the target object of customer service personnel (such as bank customer service personnel). The customer service side processing device can be a customer service side PC or any other device with processing functions.

The customer service processing equipment can generate a three-dimensional grid model of the target object by executing the corresponding steps of the method, and transmit data to be transmitted to a server for processing; the data to be transmitted here may include: the three-dimensional grid model comprises a color image, texture information, at least part of information in a foreground image of a target object, a three-dimensional grid model and synchronously acquired audio data, wherein the color image and the texture information correspond to the target object under multiple viewing angles.

The server renders a three-dimensional model frame corresponding to the three-dimensional grid model based on the received color image, texture information and at least part of information in the foreground image, projects a three-dimensional image corresponding to the target object in the current gesture to the physical space where the interaction opposite terminal is located based on the three-dimensional object model obtained by rendering, and synchronizes sound information of the target object to the interaction opposite terminal based on the received audio data.

Further, optionally, the rendering of the model frame corresponding to the three-dimensional grid model of the target object is divided into two parts, in the local rendering of the customer service side processing device and the remote rendering of the server, the rendering of the foreground part in the model frame is completed based on the foreground diagram of the target object in the local rendering, the areas of other parts of the model are set to be constant background color values, and the subsequent rendering of the areas of other parts in the model frame is completed by the server side based on the received RGB images and texture information of each view angle of the target object.

Before transmitting the data to the server, the customer service processing device may further perform compression processing on the data to be transmitted, where the compression processing at least includes: and determining repeated data of each site of the three-dimensional grid model of the target object at the current time sequence point compared with the historical time sequence point, and eliminating the repeated data.

Specifically, the three-dimensional grid model of the target object includes a correspondence between indexes of each site and point values (spatial position coordinates or distance vectors), and may include corresponding rendering information if the foreground portion thereof is rendered, where if the three-dimensional grid model to be transmitted to the server at the current time is the same as the point values of some points in the three-dimensional grid model transmitted at the previous time, that is, there are duplicate points in the three-dimensional grid model transmitted at the previous time, only the indexes corresponding to the duplicate points are transmitted at the present time, the point values thereof are not transmitted, and the server obtains the data of the missing sites from the three-dimensional grid model of the corresponding historical time sequence points (the time sequence points at the previous time) for the missing site data generated by eliminating the duplicate data, so as to reduce the transmitted data amount as much as possible through data compression, and improve the transmission efficiency.

It should be noted that, when the present application is actually applied, the method is not limited to the implementation manner of cooperation between the customer service side processing device and the server, but may be implemented by only one of the customer service side processing device and the server, for example, only the customer service side PC is used to implement all the processing flows of the method of the present application, or all the original collected data is transmitted to the server, and the server executes the full processing of the method of the present application.

An application example of the method of the application is provided below by taking a real-time remote communication scene of a bank customer service person and a customer as an example.

In this example, the technician is responsible for the following prophase:

a. setting up a customer service end scene acquisition environment;

b. developing a scene object information acquisition and real-time 4D reconstruction system;

c. developing a server system;

d. the development client receives the system.

In the scene acquisition environment, professional acquisition equipment is used for acquiring information of objects in a scene (in the example, scene objects are customer service true persons). The scene object information acquisition and real-time 4D reconstruction system is used for processing the acquired information such as the depth information, RGB color information, texture information, spatial audio and the like of customer service and reconstructing the real-time 4D, reconstructed 4D data are compressed and sent to the server, the server carries out remote rendering on the 3D data combined with the color, texture and other data, the client uses special MR glasses as a receiving end and combines with a spatial anchor point positioning technology, and the remote customer service real person after the reconstruction is shown on the client is realized.

The built collection environment and the developed systems are described in detail as follows:

customer service end scene acquisition environment

The system comprises a scene object acquisition device and a data processing device.

The scene object acquisition device is mainly used for acquiring depth information, RGB color information, texture information, spatial audio and the like of a target object in a scene. The scene object acquisition equipment comprises 2 NIR depth cameras, 1 RGB camera and 1 structure light sensor, 8 groups of acquisition equipment are configured at a customer service end corresponding to a customer service person so as to enable the customer service end to present high-quality customer service 3D images and sound, 360-degree dead-angle-free shooting is carried out on a shooting object, and the RGB camera is provided with more than 2000 ten thousand pixels. And can dispose many (e.g. 4) high-performance PCs correspondingly, and dispose the high-end display card of dual GPU, two adjacent groups of acquisition devices connect to a PC.

(II) scene object information acquisition and real-time 4D reconstruction system

Comprising the following steps: the system comprises an information acquisition module, a real-time depth estimation module, a foreground image extraction module of a scene target object, a real-time 3D model reconstruction module, a spatial audio processing module and a grid, RGB and audio data compression and transmission module.

The information acquisition module: the module is used for collecting depth, RGB color, spatial audio and other information of a reconstruction object, namely a customer service person, from 360-degree dead-angle-free areas, wherein the module is mainly used for collecting appearance, limb actions, sounds and the like of the customer service person, so that a collection camera in a scene is required to perform focusing operation on the customer service person, a group of plane image sequences are obtained through multi-angle shooting, and the image sequences are data input of a reconstruction system.

And the real-time depth estimation module is used for: the depth estimation is to estimate the depth information of the three-dimensional scene object in the plane image, so that the understanding of the scene, the object depth perception and the three-dimensional reconstruction of the object in the scene are facilitated. In order to ensure consistency of depth estimation result data information from multiple viewpoints and angles, the depth estimation result data information is performed in real time. In the embodiment, an infrared acquisition device (random infrared point array camera) is added beside a camera device of each group of acquisition devices, a plurality of infrared acquisition devices which encircle by 360 degrees are used for illuminating the whole scene, texture information of customer service at each view angle is acquired to assist in depth estimation, and depth estimation is carried out on a plane image in a mode of combining texture assistance and multi-angle depth estimation.

In order to realize the real-time performance of depth estimation, each pixel point in each frame of image is independently initialized, then the image is divided into equal-division areas according to the rule of the same size, and depth data is obtained by parallel calculation of the equal-division areas of the whole row and the whole column in the image in four directions from left to right, from top to bottom, from right to left and from bottom to top by utilizing the concurrency of high-performance GPU. The processed area is marked during the processing, and if the area traversed from other directions is the processed area, the processing is skipped and is not repeated. And finally, noise difference value removal and edge protection smoothing processing are carried out on the depth data, so that the GPU-based CUDA parallel computing architecture realizes high-quality and real-time intensive depth mapping.

A foreground image extraction module of the scene target object: the extraction of the foreground of the scene target object can provide a two-dimensional contour for object reconstruction, so that a three-dimensional object model corresponding to the foreground visual angle of the scene target object can be conveniently generated for client personnel, and the real-time three-dimensional object reconstruction is facilitated, and the transmitted data can be conveniently compressed.

The foreground map of the scene target object includes two parts, an RGB image and a depth image. The method specifically comprises the steps of firstly acquiring and recording background information (namely a depth map and an RGB map of an empty scene) of a scene from each camera, then acquiring data of the whole scene (comprising a scene background and a scene target object) from each camera, and then using a GPU (graphics processing unit) in combination with an average field inference algorithm to realize foreground map extraction of the scene target object.

And a real-time 3D model reconstruction module: the three-dimensional grid model of the target object is generated by fusing depth data of each frame of all cameras based on multi-view RGB images of 8 cameras, so that the problems of holes, deviation pixels and the like caused by time deviation, noise interference and pixel blurring of the target object in the reconstruction process are solved to a certain extent, and the 3D model reconstruction of the scene target object is realized. Simultaneously, the non-rigid scene motion between each frame is estimated while the depth map data of each frame are fused, the online generated template is deformed, the depth data are incrementally fused into a reference model, the non-rigid deformation of customer service is parameterized based on the reference model or the variant thereof, and further the estimation of the surface of a reconstructed object (customer service personnel) is advanced along with time. And the three-dimensional reconstruction is accelerated to real time by combining the high concurrence processing capability of the GPU, so that the real-time three-dimensional object reconstruction under a more complex scene can be realized.

Spatial audio processing module: audio is the basic medium of communication with each other and in order to achieve optimal immersion, the hearing and vision must be synchronized and matched. In this example, the microphone array deployed in the customer service scene is used to perform audio acquisition on the customer service personnel, and the audio data includes, in addition to the audio frame of the target object, the head pose information (head space coordinates, head orientation, etc.) of the customer service personnel. Therefore, on the receiving end of the remote client, the real sense of the person in the scene such as left and right audio, spatial sound effect, audio direction and the like can be felt by combining the spatial anchor point positioning function.

Grid, RGB, audio data compression and transfer module: the foregoing steps and modules generate a large amount of frame data, and in order to achieve the effects of optimal image quality, real-time rendering, ordered data and the like at the receiving end in a network environment, the data such as grid, RGB, audio and the like need to be compressed, converted, transmitted and the like.

And (3) compressing the reconstructed three-dimensional grid model, firstly eliminating repeated data of the grid model, reducing normal data such as grid positions and the like to 16 bits on a GPU, transmitting the data to a host memory for data serialization, and finally compressing grid index data by using LZ4 (a rapid lossless compression algorithm) while serializing.

The RGB data are used for carrying out local and remote rendering on the model frame corresponding to the reconstructed three-dimensional grid model, in the customer service PC, the frame of the grid model is only required to be rendered according to the foreground image data of the target object in the scene, before the data are transmitted to the server, the area outside the foreground part of the grid model can be set to be a constant background color value, and then the model is compressed by using the LZ4, so that the average storage space of the color picture frame data is effectively reduced.

Besides the audio stream of the target object, the audio data also comprises gesture stream, so that spatial sound effect related to the position can be obtained at the receiving end conveniently, and the audio data can be compressed by using LZ 4. During data transmission, audio data and other data are transmitted in a bidirectional independent mode, and a buffer area is used for buffering the audio data at a receiving end.

(III) Server-side System

Comprising the following steps: the system comprises a streaming media data access and distribution module, a color and texture mapping module and a remote rendering module.

And the streaming media data access and distribution module: the system is used for receiving the streaming media data such as grids, RGB, spatial audio and the like of the acquisition object after compression from the customer service side acquisition scene acquisition and reconstruction system, and distributing the streaming media information to the receiving side of the appointed request client in combination with operations such as channel management, video acceleration and the like.

Color and texture mapping module: after depth data are fused, a polygonal three-dimensional model is extracted from a TSDF model (the TSDF model divides the whole three-dimensional space to be reconstructed into geometric grids, numerical values are stored in each grid, the size of the median value of the grid model represents the distance between the grid and the reconstructed surface, the TSDF model implicitly represents the reconstructed surface), and then the model is subjected to texture and color mapping processing by using RGB images acquired from 8 cameras, so that the surface of the reconstructed object model has textures and colors.

And a remote rendering module: the client uses the MR glasses device as the receiving end, and because the GPU processing capacity of the MR glasses is limited at present, in order to be able to realize the display of the high-quality customer service object at the receiving end, the target to be displayed is remotely rendered at the server end and then transmitted to the receiving end by using the module, so that each detail of the customer service can be conveniently displayed in real time and high quality in the field of view of the client.

(IV) client receiving system

The system comprises an information presentation and space anchor point positioning module.

Information presentation and space anchor point positioning module: in order to make the rendered image presented on the MR device more realistic, the customer experiences the rendered customer service object as if he or she were in the same physical world space. The module can be used for capturing the 6-degree-of-freedom gesture, the earphone gesture and the eye gesture of the client in real time, and the front-end camera and the depth camera of the receiving end MR equipment are used for carrying out space scanning and depth perception on the real world of the client, so that the rendered customer service object is conveniently positioned into the physical world space of the client through a space anchor point positioning function, and the real-time dynamic adjustment is carried out on the presentation result by combining different gestures of the client, so that the stability of the presentation result is improved. In practical application, the space anchor point positioning function can be realized at the server side without limitation.

In this example, the real-time remote communication scene between the bank customer service personnel and the customer is divided into a client side, a customer service side and a server side.

In combination with the real-time remote interaction processing logic shown in the reference 4, the working process of each end is as follows:

customer service side working procedure:

11 Each group of acquisition equipment acquires information of a target object (customer service personnel), wherein the acquired information comprises a depth map, an RGB map, spatial audio, a scene background and the like;

12 After information is acquired, the customer service side PC performs real-time depth estimation, extracts a foreground image of the target object from the scene, and is convenient for targeted 3D reconstruction and data compression transmission of the target object;

13 Performing depth data fusion and real-time 3D reconstruction of the target object according to the depth estimation result and the foreground image of the target object to generate a grid model;

14 Audio frequency is collected, the collected data comprise audio frequency and head gestures (space coordinates and directions) of a target object, and audio frequency data synchronization is conveniently carried out on the collected data and the restored 3D model of the target object;

15 The reconstructed grid data, RGB data and audio data are compressed and then sent to a server.

The working flow of the server side is as follows:

21 The server data access module receives the related data sent by the customer service end;

22 Aiming at the reconstructed 3D model grid, performing color and texture mapping by using corresponding RGB (red, green and blue), so that the surface of the reconstructed model has texture and color;

23 Based on the high-performance GPU concurrent processing capacity of the server, remote rendering is carried out on the reconstructed target object, and the reconstructed target object is sent to a designated receiving end through a distribution module.

Client (receiving end) workflow:

the client uses MR glasses such as HoloLens as a receiving end, the client wears the HoloLens, the target object model and the spatial audio after remote rendering are combined with the empty anchor point positioning and projected into the physical world space of the client, the client can see that the 3D real-time customer service image appears in the own real world, both sides can communicate with each other by wearing the HoloLens, and the client can feel that the customer can feel the customer service personnel like in the same physical space personally.

Corresponding to the above information interaction method, the present application further provides an information interaction device, where the composition structure of the device is shown in fig. 5, and the device includes:

a first acquiring unit 10, configured to acquire three-dimensional object information corresponding to a target object in a current gesture;

a second obtaining unit 20, configured to obtain a target reference model that matches the current pose of the target object and is determined from a preset reference model set, or obtain a variant of the target reference model; the reference models in the reference model set are models obtained by respectively carrying out three-dimensional object modeling on reference objects in different postures in advance;

A determining unit 30 for determining deformation information of the target object deformed in the current pose compared to the target reference model or a variant thereof according to the three-dimensional object information;

the model generating and projecting processing unit 40 is configured to generate a three-dimensional object model corresponding to the target object in the current pose according to the target reference model or the variant thereof and the deformation information, so as to project a three-dimensional image of the target object in the current pose to a physical space where the interaction opposite end is located based on the three-dimensional object model.

In an embodiment, the first obtaining unit 10 is specifically configured to:

In an embodiment, the determining unit 30 is specifically configured to:

In one embodiment, the determining unit 30 is specifically configured to, when determining spatial position information of each pixel point on the image acquired in the current pose according to the color image and the depth image respectively corresponding to the target object in different perspectives of the scene in which the target object is located in the current pose:

In one embodiment, the model generating and projection processing unit 40 is specifically configured to, when generating the three-dimensional object model:

In an embodiment, the model generating and projection processing unit 40 is further configured to, before generating the three-dimensional object model corresponding to the target object in the current pose:

And determining a foreground image of the target object according to the environment background image and the fusion image, generating a three-dimensional object model matched with the foreground view angle of the target object based on the foreground image as the target object, and correspondingly enabling the foreground part of the target object to face the interaction opposite end when the three-dimensional image of the target object is projected to the interaction opposite end.

In an embodiment, the apparatus further includes:

an audio synchronization unit for: acquiring audio data of a synchronously acquired target object in a current gesture; when a three-dimensional image of a target object is projected, synchronizing sound information of the target object to an interaction opposite terminal based on the audio data;

In an embodiment, the target object is a customer service person, and the processing logic of the information interaction device is executed through customer service processing equipment and a server;

generating a three-dimensional grid model of the target object at customer service processing equipment, and transmitting data to be transmitted to a server for processing; the data to be transmitted comprises: the color image, texture information and at least part of information in a foreground image of the target object corresponding to the target object under multiple visual angles, and the three-dimensional grid model and synchronously acquired audio data;

In an embodiment, before transmitting data to the server, the customer service processing device further renders a foreground portion of a three-dimensional model frame corresponding to the three-dimensional grid model based on a foreground map of a target object, and/or compresses data to be transmitted;

In one embodiment, the model generating and projecting processing unit 40 is specifically configured to, when projecting, to the physical space where the interaction peer is located, a three-dimensional image corresponding to the target object in the current pose:

Acquiring physical space information of an interactive object;

For the information interaction device provided by the embodiment of the present application, since the information interaction device corresponds to the information interaction method provided by each method embodiment, the description is relatively simple, and the relevant similarities refer to the description of each method embodiment, and are not described in detail herein.

In addition, the embodiment of the application also provides an information interaction system, which comprises:

a client;

It should be noted that, in the present specification, each embodiment is described in a progressive manner, and each embodiment is mainly described as different from other embodiments, and identical and similar parts between the embodiments are all enough to be referred to each other.

For convenience of description, the above system or apparatus is described as being functionally divided into various modules or units, respectively. Of course, the functions of each element may be implemented in the same piece or pieces of software and/or hardware when implementing the present application.

From the above description of embodiments, it will be apparent to those skilled in the art that the present application may be implemented in software plus a necessary general hardware platform. Based on such understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art in the form of a software product, which may be stored in a storage medium, such as a ROM/RAM, a magnetic disk, an optical disk, etc., including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method described in the embodiments or some parts of the embodiments of the present application.

Finally, it is further noted that relational terms such as first, second, third, fourth, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises an element.

The foregoing is merely a preferred embodiment of the present application and it should be noted that modifications and adaptations to those skilled in the art may be made without departing from the principles of the present application, which are intended to be comprehended within the scope of the present application.

Claims

1. An information interaction method, comprising:

2. The method according to claim 1, wherein the acquiring three-dimensional object information corresponding to the target object in the current pose includes:

3. The method according to claim 2, wherein the determining deformation information of the target object in the current pose compared to the deformation of the target reference model or the variant thereof based on the three-dimensional object information comprises:

4. A method according to claim 3, wherein the determining spatial position information of each pixel point on the acquired image in the current pose according to the color image and the depth image of the target object corresponding to the current pose respectively at different viewing angles of the scene comprises:

5. A method according to claim 3, wherein generating a three-dimensional object model corresponding to the target object in the current pose according to the target reference model or variant thereof and the deformation information comprises:

6. The method of claim 1, further comprising, prior to generating the three-dimensional object model corresponding to the target object at the current pose:

7. The method as recited in claim 1, further comprising:

8. The method according to claim 1, wherein the target object is a customer service person, and the information interaction method is executed through customer service processing equipment and a server;

9. The method according to claim 8, wherein the customer service side processing device further renders a foreground portion of the three-dimensional model frame corresponding to the three-dimensional grid model based on a foreground map of a target object and/or compresses data to be transmitted before transmitting the data to the server;

10. The method according to claim 1, wherein the projecting the three-dimensional image corresponding to the target object in the current pose to the physical space where the interaction peer is located includes:

Acquiring physical space information of an interactive object;

11. An information interaction device, comprising:

12. An information interaction system, comprising:

a client;

in the process of information interaction with the client, the server projects a three-dimensional image corresponding to the target object in the current gesture to the physical space where the client is located by executing the method as set forth in any one of claims 1 to 10.