CN112379812B

CN112379812B - Simulation 3D digital human interaction method and device, electronic equipment and storage medium

Info

Publication number: CN112379812B
Application number: CN202110019675.0A
Authority: CN
Inventors: 杨国基; 陈泷翔; 王鑫宇; 刘云峰; 吴悦
Original assignee: Shenzhen Zhuiyi Technology Co Ltd
Current assignee: Shenzhen Zhuiyi Technology Co Ltd
Priority date: 2021-01-07
Filing date: 2021-01-07
Publication date: 2021-04-23
Anticipated expiration: 2041-01-07
Also published as: WO2022148083A1; CN112379812A

Abstract

The embodiment of the application provides a method and a device for simulating 3D digital human interaction, electronic equipment and a storage medium, and relates to the technical field of human-computer interaction. The method comprises the following steps: acquiring scene data acquired by an acquisition device; if the target user exists in the scene according to the scene data, processing the scene data to acquire a first relative position between the target user and the display screen; if the target user is located in the preset area, acquiring a target simulation digital human image corresponding to the first relative position based on a preset simulation digital human model, wherein the target simulation digital human image comprises a simulation digital human with the face facing the target user, and the preset area is an area with the distance from the display screen smaller than a preset numerical value; and displaying the target simulation digital human image on a display screen. According to the embodiment of the application, the vivid digital human image with the face facing the target user can be generated according to the position of the target user, so that the 3D stereoscopic visual effect is realized, and the interaction experience of the user is improved.

Description

Simulation 3D digital human interaction method and device, electronic equipment and storage medium

Technical Field

The present application relates to the field of human-computer interaction technologies, and in particular, to a method and an apparatus for simulating 3D digital human interaction, an electronic device, and a storage medium.

Background

In recent years, with the progress of science and technology, an intelligent human-computer interaction mode has gradually become a hot spot of domestic and foreign research, and some intelligent devices or applications are provided with an avatar so as to realize visual interaction with a user through the avatar, thereby improving the human-computer interaction experience of the user. However, in most current scenes, the human-computer interaction mode based on the virtual image is single, the displayed virtual image cannot be changed according to the change of the user position, the interaction state between people in the real environment cannot be simulated, and the interaction experience of the user is poor.

Disclosure of Invention

In view of the above problems, the present application provides a method, an apparatus, an electronic device and a storage medium for simulating 3D digital human interaction, which can solve the above problems.

In a first aspect, an embodiment of the present application provides a method for simulating 3D digital human interaction, which is characterized by comprising: acquiring scene data acquired by an acquisition device; if the target user exists in the scene according to the scene data, processing the scene data to acquire the relative position of the target user and the display screen; if the target user is located in a preset area, acquiring a target simulation digital person image corresponding to the relative position based on a preset simulation digital person model, wherein the target simulation digital person image comprises a simulation digital person with a face facing the target user, and the preset area is an area with a distance from the display screen smaller than a preset value; displaying the target simulated digital human image on the display screen.

Optionally, the preset simulated digital human model is a model obtained by training in advance according to a plurality of sample images including a real model and reference parameters corresponding to each of the sample images, the simulated digital human model is configured to output simulated digital human images corresponding to the sample images according to input reference parameters, and the target simulated digital human images corresponding to the relative positions are obtained based on the preset simulated digital human model, including: according to the relative position, determining a target reference parameter in a plurality of preset reference parameters, wherein the reference parameter is used for representing the pose of the real human model contained in the sample image relative to an image acquisition device for acquiring the sample image; and inputting the target reference parameter into the preset simulation digital human model, and taking the output simulation digital human image as the target simulation digital human image.

Optionally, the determining a target reference parameter in a plurality of preset reference parameters according to the relative position includes: determining a user visual angle parameter according to the relative position, wherein the user visual angle parameter is used for representing the visual angle of the target user towards the preset position of the display screen; and determining the target reference parameter according to the user view angle parameter in the preset multiple reference parameters.

Optionally, the determining the user perspective parameter according to the relative position includes: determining a target display position of the display screen according to the relative position, wherein the target display position is the display position of the target simulation digital human image on the display screen; and determining the user view angle parameter according to the relative position and the target display position.

Optionally, before the acquiring the target simulated digital human image corresponding to the relative position based on the preset simulated digital human model, the method further includes: acquiring interactive information; processing the interactive information to obtain response voice information; the inputting the target reference parameter into the preset simulation digital human model and the taking the output simulation digital human image as the target simulation digital human image comprise: inputting the target reference parameters and the response voice information into the preset simulation digital human model to obtain an output image sequence, wherein the image sequence is composed of a plurality of continuous target simulation digital human images; the displaying the target simulated digital human image on the display screen includes: and generating and outputting a video of the simulated digital person according to the image sequence, and synchronously playing the response voice information.

Optionally, the simulating digital human model includes a feature generation model and an image generation model, and the inputting the target reference parameter and the response voice information into the preset simulating digital human model to obtain an output image sequence includes: inputting the target reference parameters into the feature generation model to obtain initial feature parameters, wherein the initial feature parameters are used for representing the form of the real human model corresponding to the sample image; adjusting at least one parameter of expression parameters, action parameters and mouth shape parameters of the initial characteristic parameters according to the response voice information to obtain a parameter sequence, wherein the parameter sequence comprises a plurality of target characteristic parameters; and acquiring a target simulation digital human image corresponding to each target characteristic parameter based on the image generation model so as to obtain the image sequence corresponding to the parameter sequence.

Optionally, the orientation angle of the simulated digital person in the target simulated digital person image is the same as the orientation angle of the real-person model in the sample image corresponding to the target reference parameter.

Optionally, the physical features of the simulated digital person in the target simulated digital person image are the same as the physical features of the real person model in the sample image corresponding to the target reference parameter.

Optionally, the scene data includes a scene image, and if it is determined that a target user exists in the scene according to the scene data, processing the scene data to obtain a relative position between the target user and the display screen includes: judging whether the target user exists in the scene image or not; if so, identifying the scene image to acquire a three-dimensional coordinate of the target user in a camera coordinate system, wherein the camera coordinate system takes the position of the acquisition device as an origin; acquiring the position relation between the acquisition device and the display screen, and determining the conversion relation between the camera coordinate system and a space coordinate system according to the position relation, wherein the space coordinate system takes the position of the display screen as an origin; determining the relative position of the target user and the display screen in the spatial coordinate system based on the transformed relationship and the three-dimensional coordinates, the relative position including at least one of a relative distance and a relative angle.

Optionally, the scene data includes a scene image, and if it is determined that a target user exists in the scene according to the scene data, the processing the scene data to obtain a relative position between the target user and the display screen includes: identifying head information in the scene image; acquiring the number of users in the scene image according to the head information; if the number of the users is one, taking the identified user as the target user; and processing the scene image to acquire the relative position of the target user and the display screen.

Optionally, if the number of the users is multiple, monitoring whether an interactive instruction input by the user is acquired; and if the interactive instruction input by the user is acquired, taking the user corresponding to the interactive instruction as the target user.

Optionally, the scene data is data collected in real time, and after the target simulated digital human image is displayed on the display screen, the method further includes: if the relative position is detected to change, generating a new target simulation digital human image according to the changed relative position; displaying the new target simulated digital human image on the display screen.

In a second aspect, an embodiment of the present application provides a simulated 3D digital human interaction device, which includes a data acquisition module, a position acquisition module, an image acquisition module, and a display module, wherein the data acquisition module is configured to acquire scene data acquired by an acquisition device; the position acquisition module is used for processing the scene data to acquire the relative position of the target user and the display screen if the target user exists in the scene according to the scene data; the image acquisition module is used for acquiring a target simulation digital person image corresponding to the relative position based on a preset simulation digital person model if the target user is located in a preset area, wherein the target simulation digital person image comprises a simulation digital person with the face facing the target user, and the preset area is an area with the distance from the display screen smaller than a preset value; and the display module is used for displaying the target simulation digital human image on the display screen.

Optionally, the preset simulated digital human model is a model obtained by training in advance according to a plurality of sample images containing a real human model and reference parameters corresponding to each sample image, the simulation digital human model is used for outputting a simulation digital human image corresponding to the sample image according to the input reference parameters, the image acquisition module comprises a parameter determination submodule and a parameter input submodule, wherein the parameter determination submodule is used for determining a target reference parameter in a plurality of preset reference parameters according to the relative position, the reference parameters are used for characterizing the pose of the real-person model contained in the sample image relative to an image acquisition device acquiring the sample image, the parameter input submodule is used for inputting the target reference parameter into the preset simulation digital human model and taking the output simulation digital human image as the target simulation digital human image.

Optionally, the parameter determining submodule includes a first parameter determining unit and a second parameter determining unit, where the first parameter determining unit is configured to determine a user viewing angle parameter according to the relative position, and the user viewing angle parameter is used to represent a viewing angle of the target user toward a preset position of the display screen; the second parameter determining unit is configured to determine the target reference parameter according to the user perspective parameter in the preset multiple reference parameters.

Optionally, the first parameter determining unit includes a position determining subunit and a viewing angle parameter determining subunit, where the position determining subunit is configured to determine a target display position of the display screen according to the relative position, and the target display position is a display position of the target simulated digital human image on the display screen; and the visual angle parameter determining subunit is configured to determine the user visual angle parameter according to the relative position and the target display position.

Optionally, the simulation 3D digital human interaction device further includes an interaction information obtaining module and a voice information obtaining module, the interaction information obtaining module is configured to obtain interaction information, the voice information obtaining module is configured to process the interaction information to obtain response voice information, the parameter input sub-module includes an image sequence obtaining unit, the image sequence obtaining unit is configured to input the target reference parameter and the response voice information into the preset simulation digital human model to obtain an output image sequence, the image sequence is formed by multiple continuous frames of the target simulation digital human images, the display module includes a video output unit, and the video output unit is configured to generate and output a video of a simulation digital human according to the image sequence, and synchronously play the response voice information.

Optionally, the simulated digital human model includes a feature generation model and an image generation model, and the image sequence obtaining unit includes an initial feature parameter obtaining subunit, a parameter sequence obtaining subunit and an image sequence obtaining subunit, where the initial feature parameter obtaining subunit is configured to input the target reference parameter into the feature generation model to obtain an initial feature parameter, and the initial feature parameter is used to characterize a form of the real human model corresponding to the sample image; the parameter sequence acquiring subunit is configured to adjust at least one of an expression parameter, an action parameter, and a mouth shape parameter of the initial feature parameter according to the response voice information to obtain a parameter sequence, where the parameter sequence includes a plurality of target feature parameters; the image sequence obtaining subunit is configured to obtain, based on the image generation model, a target simulated digital human image corresponding to each target characteristic parameter, so as to obtain the image sequence corresponding to the parameter sequence.

Optionally, the position obtaining module includes a judging submodule, a coordinate obtaining submodule, a conversion relation determining submodule and a position determining submodule, the judging submodule is configured to judge whether the target user exists in the scene image, and the coordinate obtaining submodule is configured to identify the scene image to obtain a three-dimensional coordinate of the target user in a camera coordinate system if the target user exists in the scene image, where the camera coordinate system uses the position of the collecting device as an origin; the conversion relation determining submodule is used for acquiring the position relation between the acquisition device and the display screen and determining the conversion relation between the camera coordinate system and a space coordinate system according to the position relation, wherein the space coordinate system takes the position of the display screen as an origin; the position determining submodule is configured to determine the relative position of the target user and the display screen in the spatial coordinate system based on the conversion relation and the three-dimensional coordinate, where the relative position includes at least one of a relative distance and a relative angle.

Optionally, the position obtaining module includes an image recognition sub-module, a user number obtaining sub-module, and a first processing sub-module, where the image recognition sub-module is configured to recognize the number of users in the scene image, the user number obtaining sub-module is configured to obtain the number of users in the scene image according to the number of users, and the first processing sub-module is configured to, if the number of users is one, use the identified user as the target user.

Optionally, the simulation 3D digital human interaction device further includes an instruction monitoring sub-module and a second processing sub-module, where the instruction monitoring sub-module is configured to monitor whether an interaction instruction input by a user is acquired if the number of users is multiple; and the second processing sub-module is used for taking the user corresponding to the interactive instruction as the target user if the interactive instruction input by the user is obtained.

Optionally, the scene data is data acquired in real time, and after the target simulated digital human image is displayed on the display screen, the simulated 3D digital human interaction device further includes a position detection module and a display update module, where the position detection module is configured to generate a new target simulated digital human image according to a changed relative position if the change of the relative position is detected; the display updating module is used for displaying the new target simulation digital human image on the display screen.

In a third aspect, an embodiment of the present application provides an electronic device, including: one or more processors; a memory; one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by the one or more processors, the one or more programs configured to perform the method of the first aspect described above.

In a fourth aspect, embodiments of the present application provide a computer-readable storage medium, in which program codes are stored, and the program codes can be called by a processor to execute the method according to the first aspect.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

FIG. 1 is a schematic diagram of an application environment suitable for use in embodiments of the present application;

FIG. 2 is a flow diagram illustrating a method for simulating 3D digital human interaction provided by an embodiment of the present application;

FIG. 3 illustrates a flow diagram of a method for simulating 3D digital human interaction provided by yet another embodiment of the present application;

FIG. 4 shows a schematic flow diagram of a method for simulating 3D digital human interaction provided by another embodiment of the present application;

FIG. 5 shows a schematic flow diagram of a method for simulating 3D digital human interaction provided by yet another embodiment of the present application;

FIG. 6 is a flow chart illustrating a method for simulating 3D digital human interaction according to still another embodiment of the present application;

FIG. 7 shows a schematic flow diagram of a method of simulating 3D digital human interaction provided by yet another embodiment of the present application;

FIG. 8 shows a schematic flow diagram of a method for simulating 3D digital human interaction provided by yet another embodiment of the present application;

FIG. 9 is a block diagram illustrating the structure of an emulated 3D digital human interaction device provided by an embodiment of the present application;

fig. 10 is a block diagram illustrating an electronic device for performing a method of simulating 3D digital human interaction according to an embodiment of the present application;

fig. 11 illustrates a storage unit for storing or carrying program code implementing the method for simulating 3D digital human interaction according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

Definition of terms

3D digital human: and (3) realizing digital human by computer graphics technologies such as 3D modeling and rendering.

Simulating a digital person: and generating a vivid image with the image quality of each frame being similar to that of the camera through the deep learning model, wherein the digital person has the effect like a real person shot by the camera. Alternatively, a video digital person may be generated from a coherent realistic image.

Simulating a 3D digital person: the digital man is generated by simulating the digital man technology, and the three-dimensional vivid effect is realized by simulating the digital man in consideration of the spatial positions and the display visual angles of the digital man and audiences. Alternatively, a stereorealistic video digital person may be generated from a sequence of multiple simulated digital person images.

At present, in most interactive scenes using avatars, in order to improve the fidelity of the rendered picture, avatars generated according to real-person models, namely digital persons, are generally used for interaction. Moreover, some actions can be designed in advance for the digital people, and the actions can be matched with interactive voice content or character content so as to improve the impression experience of users. Although the way of associating actions with interactive content enables the action pose of the digital person to approximate the real person model, this way only coordinates the interactive content with the actions and does not establish a connection between the user's location and the digital person. In practical applications, if a user is in a relatively biased area of a screen for playing a digital person picture, that is, if the user has a relatively large angle with respect to the center of the screen, the digital person in the screen is still normally in a fixed position and is directly in front of the screen. In real life, people and people usually communicate face to face, and the angle of the digital people is obviously inconsistent with the interaction state between people and people in the real environment. Therefore, the presentation mode of the digital person in the prior art does not deeply consider the behavior of the user, and further causes the fidelity of the digital person presentation picture to be low, the interaction is unnatural, and the interaction experience of the user is poor.

Although a 3D digital person can present a stereoscopic visual effect, the 3D digital person is a digital person that is realized by computer graphics techniques such as 3D modeling and rendering, and the presented digital person effect is usually an animation effect and cannot achieve an effect like a camera shooting a real person.

In order to improve the above problem, the present inventors studied how to consider the behavior of the user more when the user interacts with the digital human, so as to achieve a natural anthropomorphic interaction effect. Based on the above, the inventor provides a method, a device, an electronic device and a medium for simulating 3D digital human interaction, so that a user can display a simulated digital human with a face facing a target user to interact with the user according to the position of the user in a human-computer interaction process, the simulated digital human not only is vivid like a real human shot by a camera, but also can simulate an interaction effect of face-to-face communication between the user and a real human model, thereby realizing anthropomorphic natural interaction and improving the interaction experience of the user.

Referring to fig. 1, fig. 1 is a schematic diagram illustrating an application environment suitable for the embodiment of the present application. The method for simulating 3D digital human interaction provided by the embodiment of the application can be applied to an interactive system 10 shown in FIG. 1. The interactive system 10 comprises a terminal device 101 and a server 102. The server 102 and the terminal device 101 are connected through a wireless or wired network, so as to realize data transmission between the terminal device 101 and the server 102 based on the network connection, wherein the transmitted data includes but is not limited to audio, video, text, images and the like.

The server 102 may be an individual server, a server cluster, a server center formed by a plurality of servers, a local server, or a cloud server. Server 102 may be configured to provide a background service for a user, which may include, but is not limited to, an emulated 3D digital human interaction service, and the like.

In some embodiments, terminal device 101 may be a variety of electronic devices having a display screen and supporting data entry, including but not limited to smartphones, tablets, laptop portable computers, desktop computers, wearable electronic devices, and the like. Specifically, the data input may be based on voice input by a voice module, character input by a character input module, image input by an image input module, video input by a video input module, and the like, which are provided on the intelligent terminal device 101, or may be based on a gesture recognition module installed on the intelligent terminal device 101, so that a user may implement an interactive mode such as gesture input.

In some embodiments, a client application may be installed on the terminal device 101, and the user may communicate with the server 102 based on the client application (e.g., APP, etc.). Specifically, the terminal device 101 may obtain input information of a user, and based on a client application program on the terminal device 101 communicating with the server 102, the server 102 may process the received input information of the user, and the server 102 may further return corresponding output information to the terminal device 101 according to the information, and the terminal device 101 may perform an operation corresponding to the output information. The input information of the user may be voice information, touch operation information based on a screen, gesture information, action information, and the like, and the output information may be an image, a video, a text, an audio, and the like, which is not limited herein.

In some embodiments, the client application may provide human-computer interaction services based on simulated digital humans, which may vary based on different scene requirements. For example, the client application program can be used for providing product display information or service guide to users in public areas such as shopping malls, banks, exhibition halls and the like, and different interactive services can be provided for different application scenes.

In some embodiments, after acquiring the reply information corresponding to the information input by the user, the terminal device 101 may display an emulated digital person corresponding to the reply information on a display screen of the terminal device 101 or other image output device connected thereto. The simulated digital human can be an image which is built according to the shape of the user or other people and is similar to a real human, and can also be a robot with an animation effect, such as a robot with an animal shape or a cartoon character shape. As a mode, while the simulated digital human image is displayed, the audio corresponding to the simulated digital human image may be played through the speaker of the terminal device 101 or other audio output devices connected thereto, and the text or the graphic corresponding to the reply information may also be displayed on the display screen of the terminal device 101, so as to implement multi-state interaction with the user in multiple aspects of image, voice, text, and the like.

In some embodiments, the device for processing the user input information may also be disposed on the terminal device 101, so that the terminal device 101 can realize interaction with the user without relying on establishing communication with the server 102, and realize human-computer interaction based on digital people, and in this case, the interactive system 10 may only include the terminal device 101.

The above application environments are only examples for facilitating understanding, and it is to be understood that the embodiments of the present application are not limited to the above application environments.

The following describes in detail a method, an apparatus, an electronic device, and a medium for simulating 3D digital human interaction provided by the embodiments of the present application with specific embodiments.

Referring to fig. 2, fig. 2 is a schematic flowchart of a method for simulating 3D digital human interaction according to an embodiment of the present application, and the method is applied to the terminal device, and includes steps S110 to S140.

Step S110: and acquiring scene data acquired by the acquisition device.

The acquisition device can be a device arranged in the terminal equipment, and can also be a device connected with the terminal equipment. Wherein, the acquisition device can be an image acquisition device, an infrared sensor, a microphone, a laser ranging sensor and the like. Specifically, the image acquisition device may be a common camera, or may be a camera capable of acquiring spatial depth information, such as a binocular camera, a structured light camera, and a TOF camera. The infrared sensor may be a distance sensor or the like having an infrared function. In some embodiments, the image capture device may also automatically change the lens angle to capture images at different angles.

The acquisition device is used for acquiring scene data in a current scene, wherein the current scene is a scene where the terminal equipment is located currently. The scene data can be at least one of visual data, infrared data and sound data according to different types of the acquisition devices.

Step S120: and if the target user exists in the scene according to the scene data, processing the scene data to acquire the relative position of the target user and the display screen.

After the scene data is acquired, whether a target user exists in the scene or not can be judged by analyzing the scene data, and the target user is a real user in the scene. If the scene data is analyzed to determine that the target user exists in the scene, the scene data can be processed to obtain the relative position of the target user and the display screen.

The relative position is used for representing the position relationship between the target user and the display screen, and may include information such as a relative distance and a relative angle between the target user and the display screen. As a mode, the relative position may be a position relationship between a key point of the target user and a preset position on the display screen, where the key point may be an eye, a face center point, a limb part, and the like of the target user, and the position of the key point may be determined by image detection, sensor data processing, and the like, which is not limited herein; the preset position may be a center point of the display screen, a frame of the display screen, a display position for displaying the simulated digital human image, and the like, which is not limited herein.

Specifically, when the target user is determined to exist in the scene according to the scene data, the relative position information of the target user and the acquisition device can be acquired, and further the relative position of the target user and the display screen is determined according to the relative position information.

As a mode, the position relationship between the acquisition device and the display screen can be acquired, and the conversion relationship between the camera coordinate system and the space coordinate system is determined according to the position relationship, wherein the space coordinate system takes the position of the display screen as an origin; and determining the relative position of the target user and the display screen in the space coordinate system based on the conversion relation and the three-dimensional coordinates. In this way, a more accurate relative position of the user with respect to the display screen may be obtained.

Alternatively, the relative position information of the target user and the acquisition device may be used as the relative position of the target user and the display screen. It can be understood that, when the acquisition device is a device built in the terminal device, or the acquisition device is connected with the terminal device and has a short distance, the relative position information and the relative position have a small difference, and the relative position information of the target user and the acquisition device can be used as the relative position of the target user and the display screen, so that the position relation between the acquisition device and the display screen does not need to be acquired in advance before use, the position of the acquisition device can be changed, and the flexibility is better.

It is understood that the acquired scene data is different according to the acquisition device.

As one way, when the scene data is the visual data collected by the image collecting device, the visual data may be analyzed to determine whether the target user exists in the scene. For example, whether the target user exists in the scene may be determined by means of face detection, pedestrian recognition, and the like. When the target user is determined to exist in the scene, the relative position information of the target user and the acquisition device can be further acquired through image ranging or depth image data analysis.

Alternatively, when the scene data is infrared data collected by an infrared sensor, whether a target user exists in the scene may be determined by analyzing the infrared data. Specifically, the infrared sensor may transmit infrared light, which is reflected when the infrared light encounters an obstacle, and the infrared sensor may acquire the intensity of the reflected infrared light, and the intensity of the infrared light is proportional to the distance between the obstacles. Therefore, whether a target user exists in the scene can be determined by analyzing the infrared data, and when the target user exists in the scene, the relative position information of the target user and the acquisition device is further determined.

As still another way, when the scene data is sound data collected by a sound collection device such as a microphone, the sound data may be analyzed to determine whether a target user exists in the scene. Specifically, whether a target user exists in the current scene or not can be determined through modes such as human voice detection, and if the target user exists, the relative position information of the target user and the acquisition device can be further acquired through modes such as voice distance measurement.

In some embodiments, when it is determined from the scene data that the target user does not exist within the scene, a simulated digital human image of a preset to-be-awakened state may be displayed on the display screen. As one way, the preset simulated digital human image to be awakened may be a simulated digital human image with the face facing right in front. As another mode, the preset simulated digital human image in the state to be awakened may also be a dynamically steered simulated digital human image sequence, that is, a dynamic simulated digital human video, so as to show the characteristic that the simulated digital human can present different angles to the user. For example, it may be an artificial digital person that dynamically changes from 15 degrees towards the left to 15 degrees towards the right. Optionally, the preset simulated digital human image or simulated digital human image sequence may also be a simulated digital human calling to remind the user of interaction.

Step S130: and if the target user is located in the preset area, acquiring a target simulation digital human image corresponding to the relative position based on the preset simulation digital human model.

The preset area is a preset area for interacting with a target user in the area. The preset area may be an area having a distance from the display screen smaller than a preset value. In some embodiments, the preset region may also be a region which is at a distance less than a preset value from the display screen and at an angle less than a preset angle from the display screen. Specifically, whether the target user exists in the preset area can be judged by comparing the relative position with a preset numerical value.

As a mode, a region in a scene corresponding to the scene data may be used as a preset region, that is, the preset region is a region in which the acquisition device can acquire the scene data, and if it is determined that the target user exists in the scene according to the scene data, it is determined that the target user is located in the preset region. In this way, active interaction can be performed when a target user is detected in an area with low people stream density and fewer users.

As another way, the preset area is a smaller area than the area of the scene data. For example, when the acquisition device is a sensor that can acquire scene data within 10 meters from the display screen, the preset area may be an area less than 5 meters away from the display screen. In this way, when the target user is detected in the area with high people stream density and more users, the interaction intention of the user can be determined according to the distance between the user and the display screen.

By setting the preset area for interaction, whether the target user interacts with the target user can be determined according to whether the target user is located in the preset area, and on one hand, the interaction can be carried out under the condition that the user does not sense, and the interaction is more natural; on the other hand, in the multi-person interaction scene, the interaction intention of the user can be further determined according to the preset area, and the user in the preset area is regarded as the user with the interaction intention, so that the interaction is accurately carried out. For example, when the terminal device is a large-screen device arranged in a company hall and the preset area is a company foreground position, there may be multiple users in the company hall under the condition of a large amount of people flow, and the terminal device cannot know which user to interact with, and when there is a user located in the company foreground position, the user may be used as a target user to interact with.

In some embodiments, the preset simulated digital human model is a model obtained by training in advance according to a plurality of sample images including a real human model and a reference parameter corresponding to each sample image, and the simulated digital human model is used for outputting a simulated digital human image corresponding to a sample image according to an input reference parameter. Determining a target reference parameter in a plurality of preset reference parameters according to the relative position; and inputting the target reference parameters into a preset simulation digital human model, and taking the output simulation digital human image as a target simulation digital human image. The reference parameter can be used for characterizing the relative position of the real-person model contained in the sample image and the image acquisition device for acquiring the sample image, and the relative position can be a relative angle or a relative distance. Specifically, please refer to the following embodiments.

It can be understood that the process of obtaining a stereoscopic 3D digital person through 3D modeling greatly depends on the prior experience of the modeler, and 3D digital persons close to real persons are realized through a great amount of artificial adjustment, and obtaining 3D digital persons corresponding to different models requires repeated modeling processes, which consumes a great deal of labor cost. The preset simulated digital human model is a deep learning model obtained through training, 3D modeling is not needed in the process of obtaining the target simulated digital human image through the simulated digital human model, the obtained simulated digital human is closer to a real human model, the effect is more vivid, and the method is suitable for the situation that different real human models are possibly required to be modeled to obtain the simulated digital human in practical application.

In some embodiments, the scene data may include a scene image, and the head information in the scene image may be identified; acquiring the number of users in a scene image according to the head information; if the number of the users is one, the identified users are taken as target users; the scene image is processed to obtain the relative position of the target user and the display screen. Optionally, if the number of the users is multiple, monitoring whether an interactive instruction input by the user is acquired; and if the interactive instruction input by the user is acquired, taking the user corresponding to the interactive instruction as a target user. Specifically, please refer to the following embodiments.

In some embodiments, mutual information may also be obtained; processing the interactive information to obtain response voice information; and inputting the target reference parameters and the response voice information into a preset simulation digital human model to obtain an output image sequence, wherein the image sequence is composed of a plurality of continuous target simulation digital human images. Specifically, please refer to the following embodiments.

Step S140: and displaying the target simulation digital human image on a display screen.

After the target simulated digital human image is acquired, the target simulated digital human image may be displayed at a display position of a display screen. The display screen may be a display screen of the terminal device, or may be another image display device connected to the terminal device, and the display position may be a preset position for displaying the simulated digital person, or a display position determined according to the relative position. Optionally, after the target simulated digital person is displayed, prompting of human-computer interaction can be performed through voice or characters so as to guide the user to further interact. For example, in a banking scenario, the interface to wake up may display "ask you what help they need

You can try to ask me how to handle deposit

’”。

In some embodiments, when an image sequence consisting of a plurality of consecutive frames of images of a target simulated digital person is acquired, a video containing the target simulated digital person may be generated from the image sequence and displayed on a display screen. For example, before the target user is not detected, a preset simulated digital person whose face faces the front may be displayed on the display screen, and based on a preset simulated digital person model, a target simulated digital person image corresponding to the relative position may be acquired, which may be a sequence of a plurality of images corresponding to the simulated digital person whose face faces the front turning to the simulated digital person corresponding to the relative position. And synthesizing a video corresponding to the simulated digital person according to the image sequence, so that the effect of naturally steering the simulated digital person can be realized.

In some embodiments, the scene data is data collected in real time, and if a change in relative position is detected, a new target simulated digital human image is generated according to the changed relative position; displaying the new target simulated digital human image on the display screen. Specifically, please refer to the following embodiments.

In some embodiments, it may be detected whether the target user leaves a preset area, and if the target user has left, a preset simulated digital human image in a state to be woken up is displayed.

It is understood that steps S120 and S130 may be performed locally by the terminal device, may also be performed in the server, and may also be performed by the terminal device and the server separately, and according to different practical application scenarios, tasks may be allocated according to requirements, which is not limited herein.

The method for simulating 3D digital human interaction provided by this embodiment acquires scene data acquired by an acquisition device, processes the scene data to acquire a relative position between a target user and a display screen if it is determined that the target user exists in a scene according to the scene data, acquires a target simulated digital human image corresponding to the relative position based on a preset simulated digital human model if the target user is located in a preset region, and displays the target simulated digital human image on the display screen. The interactive effect of face-to-face communication between the user and the simulation digital person can be simulated, the anthropomorphic simulation digital person interaction is realized, and the interactive experience of the user is improved.

Referring to fig. 3, fig. 3 is a schematic flowchart of a method for simulating 3D digital human interaction according to an embodiment of the present application, and the method is applied to the terminal device, and includes steps S210 to S250.

Step S210: and acquiring scene data acquired by the acquisition device.

Step S220: and if the target user exists in the scene according to the scene data, processing the scene data to acquire the relative position of the target user and the display screen.

Step S230: and if the target user is located in the preset area, determining a target reference parameter in a plurality of preset reference parameters according to the relative position.

The reference parameters are used for representing the pose of the real-person model included in a sample image adopted for training a preset simulated digital human model relative to an image acquisition device for acquiring the sample image, wherein the pose can comprise position information and pose information, and correspondingly, the reference parameters can comprise at least one of distance parameters and angle parameters of the real-person model and the image acquisition device. The distance parameter is used for representing the relative distance between the real-person model and the image acquisition device, namely position information, and the angle parameter is used for representing the relative angle between the real-person model and the image acquisition device, namely posture information.

It can be understood that, in practical application, the image acquisition device can be regarded as the eyes of a target user, and the target simulation digital human image is a sample image of a real human model acquired according to the image acquisition device, namely, the effect as if the real human model is watched through the image acquisition device is realized. Although sample images corresponding to as many reference parameters as possible can be obtained to obtain simulated digital human images corresponding to different positions of a target user when training a simulated digital human model, in practical application, there may be a case where the relative positions of the target user and a display screen do not have the same reference parameters. Therefore, by determining the target reference parameter among the preset plurality of reference parameters according to the relative position, it is possible to generate the pose of the simulated digital person closest to the position of the current target user.

In some embodiments, a mapping relationship between the relative position and a plurality of preset reference parameters may be set, and a target reference parameter corresponding to the relative position may be determined according to the mapping relationship. By the method, on one hand, the requirement on the precision of the relative position can be reduced, and the 3D effect of the simulated digital human can be realized as long as the approximate range of the relative position is determined, so that the requirement on the acquisition device is reduced, and the power consumption for processing the scene data to acquire the first relative position is reduced; on the other hand, the number of sample images with different reference parameters required for training the preset simulated digital human image can be reduced.

As one way, when the reference parameter includes a relative angle, an angle mapping relationship may be set in advance, and the target reference parameter corresponding to the relative position may be determined based on the angle mapping relationship. Specifically, the angle mapping relationship includes a plurality of angle intervals and an angle parameter corresponding to each angle interval, and the angle interval to which the relative position belongs may be determined by the angle mapping relationship, and the angle parameter corresponding to the angle interval is used as the target reference parameter.

As another mode, when the reference parameter includes a relative distance, a distance mapping relationship may be set in advance, and the target reference parameter corresponding to the relative position may be determined based on the distance mapping relationship. Specifically, the distance mapping relationship includes a plurality of distance intervals and a distance parameter corresponding to each distance interval, and the distance interval to which the relative position belongs may be determined by the distance mapping relationship, and then the distance parameter corresponding to the distance interval is used as the target reference parameter.

In some embodiments, the target reference parameter corresponding to the relative position may be determined among a plurality of preset reference parameters based on an optimal path solving algorithm. The optimal path solving algorithm may be Dijkstra algorithm, a-x algorithm, SPFA algorithm, Bellman-Ford algorithm, Floyd-Warshall algorithm, and the like, which is not limited herein. Through an optimal path solving algorithm, the reference parameter which is closest to the relative position can be determined from the multiple reference parameters and used as a target reference parameter, and therefore the closest approach of the orientation angle of the simulated digital person to the current relative position of the user is achieved.

Step S240: and inputting the target reference parameters into a preset simulation digital human model, and taking the output simulation digital human image as a target simulation digital human image.

The preset simulation digital human model is a model obtained by training in advance according to a plurality of sample images containing the real human model and the reference parameter corresponding to each sample image, and the simulation digital human model is used for outputting the simulation digital human image corresponding to the sample image according to the input reference parameter. Specifically, a plurality of images including the real-person model corresponding to different reference parameters may be acquired as sample images by the image acquisition device, and the reference parameter corresponding to each sample image is acquired. Optionally, each reference parameter may also correspond to a sample image of a plurality of real mannequins in different poses. For example, four images of a real model with four expressions of joy, anger, sadness and sadness can be collected from the same camera view as a sample image corresponding to the reference parameter.

The simulated digital human model can comprise a feature generation model and an image generation model, wherein the feature generation model and the image generation model are preset deep learning-based models. Specifically, the feature generation model is used for acquiring feature parameters of the real model in the sample image corresponding to the reference parameters according to the input reference parameters, wherein the feature parameters of the real model are features obtained by extracting face key points, posture key points, contour key points and the like of the real model in the image. And the image generation model is used for generating a corresponding simulation digital human image according to the characteristic parameters of the real human model.

After the target reference parameter is obtained, the target reference parameter can be input into a preset simulation digital human model, the characteristic parameter of the real human model in the sample image corresponding to the target reference parameter is obtained through the depth generation model, and the corresponding simulation digital human image is generated according to the characteristic parameter through the image generation model and serves as the target simulation digital human image.

In one approach, the orientation angle of the simulated digital person in the target simulated digital person image is the same as the orientation angle of the real-person model in the sample image corresponding to the target reference parameter. Wherein the orientation angle is used for representing the rotation angle of the real-person model in the sample image relative to the front face. Alternatively, the orientation angle may include at least one of a horizontal angle and a vertical angle. The horizontal angle can be used to characterize the angle of the real human model in the horizontal direction. For example, the sample images acquired by the acquisition device positioned at the left side of the real-person model and the acquisition device positioned at the right side of the real-person model correspond to different horizontal angles of the real-person model. The vertical angle may be used to characterize the angle of the real model in the upper vertical direction. For example, the acquisition device positioned at a high position for bending over and the acquisition device positioned at a low position for bending over acquire sample images corresponding to different vertical angles of the real model.

As one way, the physical features of the simulated digital person in the target simulated digital person image are the same as the physical features of the real person model in the sample image corresponding to the target reference parameter. The physical features comprise features such as expressions, body shapes, action postures and textures. In this way, the resulting simulated digital person is as realistic as a real model, visually as a real model of looking at a camera.

As one way, the orientation angle of the simulated digital person in the target simulated digital person image is the same as the orientation angle of the real-person model in the sample image corresponding to the target reference parameter, and the physical and appearance features of the simulated digital person in the target simulated digital person image are the same as the physical and appearance features of the real-person model in the sample image corresponding to the target reference parameter. In this way, the fidelity of the simulated digital person can be further improved.

By determining the target reference parameter corresponding to the relative position in the preset multiple reference parameters, the current position of the target user relative to the simulated digital person on the display screen can be converted into the position of the image acquisition device relative to the real person model when acquiring the sample image. By acquiring the target simulation digital human image corresponding to the target reference parameter, the visual experience that a target user looks at the real human model from the position of the image acquisition device can be realized, so that the simulation digital human image has a three-dimensional vivid 3D effect.

For example, when the target user is on the left side of the display screen, the target simulated digital person image includes the face of the left side of the digital person, i.e., the angle by which the simulated digital person is rotated from the front side to the right side; when the target user is facing the display screen, the target simulated digital person comprises a digital person front face; when the target user is on the right side of the display screen, the target simulated digital person image includes the face of the right side of the digital person, i.e., the angle at which the simulated digital person is rotated left from the front. According to different positions of the user, the simulation digital person image with the face facing the target user is displayed, so that the effect of face-to-face interaction between the simulation digital person and the target user is achieved. For another example, when the target user is at different distances from the display screen, the size of the simulated digital person may also be different in the target simulated digital person image.

Step S250: and displaying the target simulation digital human image on a display screen.

It should be noted that, for parts not described in detail in this embodiment, reference may be made to the foregoing embodiments, and details are not described herein again.

It is understood that steps S210 to S240 may be performed locally by the terminal device, may also be performed in the server, and may also be performed by the terminal device and the server separately, and according to different actual application scenarios, tasks may be allocated according to requirements, which is not limited herein.

The method for simulating 3D digital human interaction provided by this embodiment acquires scene data acquired by an acquisition device, determines a target reference parameter among a plurality of preset reference parameters according to a relative position if a target user is located in a preset area, inputs the target reference parameter into a preset simulated digital human model, takes an output simulated digital human image as a target simulated digital human image, and displays the target simulated digital human image on a display screen. The target reference parameters are determined in the preset multiple reference parameters, and the target simulation digital person corresponding to the target reference parameters is generated, so that the presentation angle of the simulation digital person can face to a target user, and the target simulation digital person image is generated according to the sample image containing the real model, so that the realistic effect similar to the real model can be realized.

Referring to fig. 4, fig. 4 is a schematic flowchart of a method for simulating 3D digital human interaction according to an embodiment of the present application, and the method is applied to the terminal device, and includes steps S310 to S360.

Step S310: and acquiring scene data acquired by the acquisition device.

Step S320: and if the target user exists in the scene according to the scene data, processing the scene data to acquire the relative position of the target user and the display screen.

Step S330: and if the target user is located in a preset area, determining the user view angle parameter according to the relative position.

The user visual angle parameter is used for representing the visual angle of the target user towards the preset position of the display screen. The preset position may be a position where a center point of the display screen, a frame of the display screen, or the like does not change, or may be a display position for simulating a digital human image, which is not limited herein.

In particular, a target user may be identified by processing scene data to determine a user perspective parameter. For example, the face of the target user may be detected through an image detection algorithm, so as to determine the position of the eyes of the target user, and further determine the user viewing angle parameter according to the position of the eyes of the target user and the preset position of the display screen.

By judging that the target user is located in the preset area, if the target user is located in the preset area and determining the view angle parameter of the user according to the relative position, power consumption required for identifying the target user not in the preset area to acquire the view angle parameter can be reduced, and the utilization efficiency of resources is improved.

In some embodiments, a target display position of the display screen may be determined according to the relative position, wherein the target display position is a display position of the target simulated digital human image on the display screen; and determining the user view angle parameters according to the relative position and the target display position.

Specifically, a corresponding relationship between a preset display position and a relative position of the user may be obtained, and after the relative position is obtained, the display position corresponding to the relative position is used as the target display position according to the corresponding relationship. For example, when the target user is located on the right side of the display screen, the target display location is a right area of the display screen, and when the target user is located on the left side of the display screen, the target display location is a left area of the display screen. Alternatively, when the target user moves from left to right, the simulated digital person may also move from the left area of the display screen to the right area of the display screen as if the target user were walking side-by-side with the simulated digital person. Alternatively, the target display position may also be a display position of the simulated digital human eye in the target simulated digital human image. Therefore, the visual angle parameter of the target user looking at the eyes of the simulated digital person can be acquired, and the effect that the simulated digital person looks at the target user like a real person is achieved.

In this way, different relative positions may correspond to different target display positions to make the simulated digital person more realistic and vivid. Particularly, under the condition that the display screen is a large screen, different target display positions can be determined according to different positions of a target user, so that the distance between a digital person and the target user is shortened, and more natural interpersonal interaction can be realized.

Step S340: and determining the target reference parameter according to the user view angle parameter in a plurality of preset reference parameters.

The target simulation digital human image is a sample image of a real human model acquired by the image acquisition device, and the pose of the real human model relative to the image acquisition device is the pose of the simulation digital human relative to the target user, so that the effect that the target user looks like watching the real human model through the image acquisition device is realized. Specifically, please refer to step S230.

Step S350: and inputting the target reference parameters into a preset simulation digital human model, and taking the output simulation digital human image as a target simulation digital human image.

Step S360: displaying the target simulated digital human image on the display screen.

It is understood that steps S310 to S350 may be performed locally by the terminal device, may be performed in the server, and may also be performed by the terminal device and the server separately, and according to different practical application scenarios, tasks may be allocated according to requirements, which is not limited herein.

The method for simulating 3D digital human interaction provided by this embodiment includes acquiring scene data acquired by an acquisition device, processing the scene data to acquire a relative position between a target user and a display screen if the target user is determined to be present in a scene according to the scene data, determining a user view angle parameter according to the relative position if the target user is located in a preset region, determining the target reference parameter according to the user view angle parameter among a plurality of preset reference parameters, inputting the target reference parameter into a preset simulated digital human model, taking an output simulated digital human image as a target simulated digital human image, and displaying the target simulated digital human image on the display screen. By determining the visual angle parameters of the user, the target simulation digital human image corresponding to the visual angle parameters is obtained, the fidelity of the simulation digital human is further increased, and the human-computer interaction experience is optimized.

Referring to fig. 5, fig. 5 is a schematic flowchart of a method for simulating 3D digital human interaction according to an embodiment of the present application, and the method is applied to the terminal device, and includes steps S410 to S470.

Step S410: and acquiring scene data acquired by the acquisition device.

Step S420: and if the target user exists in the scene according to the scene data, processing the scene data to acquire the relative position of the target user and the display screen.

Step S430: and if the target user is located in the preset area, acquiring the interactive information.

If the target user is located in the preset area, whether the interactive information input by the target user is acquired or not can be monitored, wherein the interactive information can be multi-mode information such as voice information, action information and touch operation information. Alternatively, the interaction information may be information of a preset interaction instruction input by the target user, or may be multi-modal information that can be recognized by the terminal device.

It can be understood that, because more power consumption is needed for monitoring the interactive information input by the user in real time, by setting the preset area, whether the interactive information is acquired is monitored only when the target user is detected to be present in the preset area, and the power consumption for monitoring the interactive information when the target user is absent can be reduced.

Step S440: the interactive information is processed to obtain responsive voice information.

In some embodiments, when the interactive information is information of a preset interactive instruction, a corresponding relationship between the interactive instruction and corresponding response voice information may be preset, and the response voice information may be acquired based on the corresponding relationship. For example, when the interactive information is a preset wake-up word, the corresponding response voice message may be "hello", which helps you do something

”。

In some embodiments, when the interactive information is voice information input by a target user, the voice information may be converted into Text by an Automatic Speech Recognition technology (ASR), and then Natural Language Understanding (NLU) may be performed on the Text To analyze the voice information, and response Text information may be obtained according To a result of the analysis, further, response voice information corresponding To the response Text information may be obtained by Text To Speech technology (TTS), wherein the Natural Language Understanding may be realized by an intention Recognition model, which may be implemented by a Recurrent Neural Network (RNN) model, a Convolutional Neural Network (CNN) model, a Variational self-coder (Variational coder, VAE) model, a Bidirectional coder representation (Bidirectional coder representation) of a transformer, BERT), and Support Vector Machine (SVM), without limitation.

Step S450: and determining a target reference parameter in a plurality of preset reference parameters according to the relative position.

Specifically, please refer to step S230.

Step S460: and inputting the target reference parameters and the response voice information into a preset simulation digital human model to obtain an output image sequence.

The image sequence is composed of a plurality of continuous target simulation digital human images, and the motion postures or facial expressions of the simulation digital human in the images can be continuously changed. Specifically, semantic information corresponding to response text information corresponding to response speech, and phonetic phoneme information may be acquired. And determining the orientation angle of the corresponding simulated digital person according to the target reference parameter, acquiring the simulated digital person with the face facing the target user, and further acquiring an image sequence of the simulated digital person corresponding to the action posture or the facial expression and the response voice information according to the response voice information. The target simulated digital person image in the image sequence is an image with a face facing the target user and the action state corresponds to the voice information.

In some embodiments, the simulated digital human model may include a feature generation model and an image generation model, and the target reference parameter may be input into the feature generation model to obtain an initial feature parameter, where the initial feature parameter is used to characterize a form of a real human model corresponding to the sample image; adjusting at least one parameter of expression parameters, action parameters and mouth shape parameters of the initial characteristic parameters according to the response voice information to obtain a parameter sequence, wherein the parameter sequence comprises a plurality of target characteristic parameters; and acquiring a target simulation digital human image corresponding to each target characteristic parameter based on the image generation model so as to obtain an image sequence corresponding to the parameter sequence.

The form of the real-person model corresponding to the sample image can comprise at least one of an orientation angle and a physical feature, namely, the orientation angle and the physical feature of the simulated digital person obtained according to the initial characteristic parameters can be the same as those of the real-person model. The preset simulation digital human model can also comprise an audio visual prediction model, and can acquire characteristic parameters corresponding to the response voice information according to the input response voice information and the initial characteristic parameters. Through the audio visual prediction model, at least one parameter of expression parameters, action parameters and mouth shape parameters of the initial characteristic parameters can be adjusted to obtain a parameter sequence consisting of a plurality of target characteristic parameters, so that the external expression of the simulated digital person corresponds to the response voice information. And then, acquiring a target simulation digital human image corresponding to each target characteristic parameter based on the image generation model to obtain an image sequence corresponding to the parameter sequence. By the method, more accurate characteristic parameters of the simulated digital person can be obtained, so that the image of the simulated digital person is more vivid and natural.

For example, when the target user is located on the left side of the screen, and the response voice information is determined to be "hello" according to the interaction information, the corresponding target reference parameter can be determined according to the relative position of the user, so that the target reference parameter is determined to determine a sample image, the initial characteristic parameter of the simulated digital person, the orientation of which accords with the position of the target user, the action parameter in the initial characteristic parameter is modified into the action parameter of the action of calling the hand to make a call according to the response voice information, the mouth shape parameter of the simulated digital person is also modified into the mouth shape parameter corresponding to the "hello", so that a plurality of target characteristic parameters, the action and the mouth shape of which correspond to the response voice information, are obtained, and the corresponding continuously-changing image sequence is obtained. Thereby, an emulated digital person whose face is facing the user and whose summons are being made by the summons can be displayed.

Step S470: and generating and outputting a video of the simulated digital person according to the image sequence, and synchronously playing response voice information.

After the image sequence is obtained, the images of the plurality of target simulated digital persons in the image sequence can be synthesized into the video of the simulated digital persons matched with the fostering voice information according to the response voice information, and the response voice information is synchronously played while the video of the simulated digital persons is displayed on the display screen. Therefore, the simulated digital person not only can display the corresponding angle according to the position of the target user so as to realize the interaction of the face facing the target user, but also can have the action state corresponding to the response voice information. Through the mode, the fidelity of the simulation digital person can be improved, and therefore the human-computer interaction experience of a user is improved.

In some embodiments, the scene data is data collected in real time, and if a change in relative position is detected, a new target simulated digital human image is generated according to the changed relative position; displaying the new target simulated digital human image on the display screen. That is, the simulated digital person corresponds not only to the response voice information but also to the real-time relative position of the target user. Thereby simulating a digital person more flexibly and vividly. Specifically, please refer to the following embodiments.

It is understood that steps S410 to S470 may be performed locally by the terminal device, may also be performed in the server, and may also be performed by the terminal device and the server separately, and according to different practical application scenarios, tasks may be allocated according to requirements, which is not limited herein.

The method for simulating 3D digital human interaction provided by the embodiment acquires scene data acquired by an acquisition device; if the target user exists in the scene according to the scene data, processing the scene data to acquire the relative position of the target user and the display screen; if the target user is located in the preset area, acquiring interactive information; processing the interactive information to obtain response voice information; determining a target reference parameter in a plurality of preset reference parameters according to the relative position; inputting the target reference parameters and the response voice information into a preset simulation digital human model to obtain an output image sequence; and generating and outputting a video of the simulated digital person according to the image sequence, and synchronously playing response voice information. Therefore, the simulated digital person with the face facing the target user can be displayed according to the position of the target user, the response voice information can be played while the video is displayed according to the action state corresponding to the response voice information of the digital person, the fidelity of the simulated digital person is further improved, and the human-computer interaction experience is optimized.

Referring to fig. 6, fig. 6 is a schematic flowchart of a method for simulating 3D digital human interaction according to an embodiment of the present application, and the method is applied to the terminal device, and includes steps S510 to S570.

Step S510: and acquiring a scene image acquired by the acquisition device.

The acquisition device can be a common camera or an image acquisition device for acquiring space depth information. For example, the image acquisition device may be a binocular camera, a structured light camera, a TOF camera, or the like.

Accordingly, the scene image may be a normal image of the current scene, or may be a depth image including depth information and color information.

Step S520: and judging whether a target user exists in the scene image.

The scene image may be analyzed to determine whether a target user is present within the scene. For example, a detection algorithm may be collected to identify the head information in the scene image, and the detection algorithm may be a yolo (young Only Look one) algorithm, an RCNN (Single Shot multi box Detector), or an algorithm that can identify a natural person in the image to perform the determination. Alternatively, whether the target user exists in the scene may be determined by other types of scene data. Please refer to step S120 for determining that a specific description of the target user exists in the scene according to the scene data, which is not described herein again.

Step S530: and if so, identifying the scene image to acquire the three-dimensional coordinates of the target user in the camera coordinate system.

Wherein, the camera coordinate system takes the position of the acquisition device as an origin. According to the scene images obtained by different acquisition devices, the scene images can be processed in different modes to identify the target user in the scene images, so that the three-dimensional coordinates of the target user in a camera coordinate system are obtained. When the image is acquired by a common camera, depth information corresponding to a target user in the image can be acquired through a depth estimation algorithm and the like, so that the three-dimensional coordinate is determined. When the image is a depth image, three-dimensional coordinates of the target user in the camera coordinate system may be calculated for the depth information. For example, when the acquisition device is a binocular camera, binocular ranging may be employed to determine the three-dimensional coordinates corresponding to the target user; when the acquisition device is a structured light camera, a three-dimensional coordinate corresponding to a target user can be determined by adopting triangular parallax ranging; when the acquisition device is a TOF camera, the operation of the light pulse from a transmitter of the TOF camera to a target object and then returning to a receiver of the TOF camera in a pixel format can be calculated, so that the corresponding three-dimensional coordinate of the target user is determined.

Optionally, the camera calibration may also be performed on the acquisition device in advance to obtain the camera external parameters and the camera internal parameters of the acquisition device, and the three-dimensional coordinates of the target user are accurately obtained by combining the camera parameters.

Step S540: and acquiring the position relation between the acquisition device and the display screen, and determining the conversion relation between the camera coordinate system and the space coordinate system according to the position relation.

The spatial coordinate system takes the position of the display screen as an origin, and can be used for representing position coordinates in the real world. The position relationship between the acquisition device and the display screen can be acquired in advance, and the coordinates of the image acquisition device in the space coordinate system are determined according to the position relationship, so that the conversion relationship between the camera coordinate system and the space coordinate system is obtained.

The position of the display screen may be a position where a center point of the display screen, a frame of the display screen, and the like do not change, or a display position for simulating a digital human image, which is not limited herein. In some embodiments, the display position of the simulated digital human image may vary depending on the relative position of the target user and the display screen.

Step S550: and determining the relative position of the target user and the display screen in the space coordinate system based on the conversion relation and the three-dimensional coordinates.

Wherein the relative position includes at least one of a relative distance and a relative angle. The relative position of the target user and the display screen may be determined in the spatial coordinate system based on the transformation relationship and the three-dimensional coordinates. In this way, a relatively accurate first relative position of the target user with respect to the display screen can be obtained.

In some embodiments, the eyes of the target user may be identified by a detection algorithm, with the position of the eyes as the three-dimensional coordinates of the target user in the camera coordinate system. And determining the conversion relation between the camera coordinate system and the space coordinate system according to the display positions of the acquisition device and the simulated digital human image on the display screen. Wherein the display position may be a position of an eye of the simulated digital person. In this way, the relative position between the eyes of the target user and the eyes of the simulated digital person on the display screen can be determined in the spatial coordinate system, so that the simulated digital person not only angularly oriented to the target user but also with the eyes looking at the target user can be acquired from the relative position.

Step S560: and if the target user is located in the preset area, acquiring a target simulation digital human image corresponding to the relative position based on the preset simulation digital human model.

In some embodiments, mutual information may also be obtained; processing the interactive information to obtain response voice information; and inputting the target reference parameters and the response voice information into a preset simulation digital human model to obtain an output image sequence, wherein the image sequence is composed of a plurality of continuous target simulation digital human images. In particular, please refer to the foregoing embodiments.

Step S570: and displaying the target simulation digital human image on a display screen.

It is understood that steps S510 to S560 may be performed locally by the terminal device, may be performed in the server, and may also be performed by the terminal device and the server separately, and according to different practical application scenarios, tasks may be allocated according to requirements, which is not limited herein.

The method for simulating 3D digital human interaction provided by the embodiment acquires scene data acquired by an acquisition device; judging whether a target user exists in the scene image or not; if so, identifying the scene image to acquire the three-dimensional coordinates of the target user in a camera coordinate system, acquiring the position relation between the acquisition device and the display screen, and determining the conversion relation between the camera coordinate system and a space coordinate system according to the position relation; determining the relative position of the target user and the display screen in the space coordinate system based on the conversion relation and the three-dimensional coordinate; if the target user is located in the preset area, acquiring a target simulation digital human image corresponding to the relative position based on a preset simulation digital human model; and displaying the target simulation digital human image on a display screen. The positions of the target user and the display screen can be accurately determined by acquiring the position relation between the acquisition device and the display screen, so that the simulated digital person with the face accurately facing the target user is acquired according to the positions, and the image fidelity of the virtual image is higher.

Referring to fig. 7, fig. 7 is a schematic flowchart of a method for simulating 3D digital human interaction according to an embodiment of the present application, and the method is applied to the terminal device, and includes steps S610 to S670.

Step S610: and acquiring scene data acquired by the acquisition device.

Step S620: head information in a scene image is identified.

The scene data comprises a scene image, and the head information in the scene image can be identified through a detection algorithm. The detection algorithm may be a yolo (young Only Look one) algorithm, an RCNN, an SSD (Single Shot multi box Detector), or the like, which can identify a natural person in an image.

In some embodiments, it may also be determined whether a target user exists in the scene through scene data obtained by the sensor, and when it is determined that the target user exists in the scene, the head information in the scene image is identified. The power consumption required by judging whether the target user exists or not according to the sensor is low, the power consumption required by image recognition is high, and the image recognition is further performed only when the target user exists in the scene, so that the power consumption required by the image recognition can be reduced.

Step S630: and acquiring the number of users in the scene image according to the head information.

By identifying the head information in the scene image, the number of users in the scene image, that is, the number of users in the current scene can be determined.

In some embodiments, if the number of the users is multiple, monitoring whether an interactive instruction input by the user is acquired; and if the interactive instruction input by the user is acquired, taking the user corresponding to the interactive instruction as the target user. For example, when a plurality of target users are located in the preset area, the front of the simulation digital person can be kept facing forward, the simulation digital person can call all the persons or not, and when a certain target user talks with the digital person, the digital person turns to the target user for interaction. When the user leaves the interactive area, the preset simulated digital human image in the state to be awakened can be displayed.

Wherein the interactive instruction can be preset multi-modal information. Specifically, the interactive instruction may be multi-modal information such as voice information, motion instruction, touch operation, and the like. The voice information can be voice information containing preset keywords, and the interaction intention of the user can be obtained by performing intention recognition on the voice information; the action instructions may be preset actions, gestures, etc. for interaction, such as a screen-oriented waving, etc. This embodiment is not limited to this.

As one mode, sound information in a scene may be collected by a microphone, and whether the sound information includes voice information of a user may be determined by human voice detection. Optionally, the preset keywords may also be detected by the acoustic model to further determine whether to acquire the interaction instruction input by the user. When the interactive instruction is voice information, as a mode, the direction of a sound source of the voice information can be determined in modes of sound distance measurement and the like, so that a user in the direction is used as a target user; alternatively, the scene image may be processed to recognize lip movements of a plurality of users, and a user who inputs an interactive instruction may be determined by lip recognition, and the user may be a target user.

Alternatively, the user who has input the motion instruction may be a target user by performing motion recognition on the scene image to determine whether there is a motion instruction input by the user. For example, gesture recognition may be performed on the scene image to detect whether a user is looking toward the screen recruit.

As another mode, whether to acquire a touch operation input by the user may be detected by the screen sensor, and if so, the user who inputs the touch operation is taken as a target user.

In still other embodiments, when the number of the users is multiple, each user in the scene image may be respectively used as a target user, multiple first relative positions of the multiple target users and the display screen may be obtained, if the multiple target users are located in a preset area, multiple target avatar images corresponding to the multiple first relative positions may be obtained based on a preset avatar model, and the multiple target avatar images may be displayed on the display screen to interact with the target users respectively. Thereby displaying a plurality of target avatars on the display screen for interaction with a plurality of target users, respectively. Therefore, each target user can interact with the virtual image in a face-to-face mode, and interaction efficiency is improved.

Step S640: and if the number of the users is one, taking the identified user as a target user.

When the number of users in the scene image is one, namely the number of users in the current scene is one, the users are taken as target users.

Step S650: the scene image is processed to obtain the relative position of the target user and the display screen.

Step S660: and if the target user is located in the preset area, acquiring a target simulation digital human image corresponding to the relative position based on the preset simulation digital human model.

Step S670: and displaying the target simulation digital human image on a display screen.

It is understood that steps S610 to S660 may be performed locally by the terminal device, may also be performed in the server, and may also be performed by the terminal device and the server separately, and according to different actual application scenarios, tasks may be allocated according to requirements, which is not limited herein.

The method for simulating 3D digital human interaction provided by the embodiment acquires scene data acquired by an acquisition device; identifying head information in a scene image; acquiring the number of users in a scene image according to the head information; if the number of the users is one, the identified users are taken as target users; processing the scene image to obtain the relative position of the target user and the display screen; if the target user is located in the preset area, acquiring a target simulation digital human image corresponding to the relative position based on a preset simulation digital human model; and displaying the target simulation digital human image on a display screen. The number of users in the scene is determined by identifying the scene image, and the virtual image of the face facing the users is displayed in different modes according to different numbers of target users in the preset area, so that the interaction modes are enriched, and the human-computer interaction efficiency is improved.

Referring to fig. 9, fig. 9 is a schematic flowchart of a method for simulating 3D digital human interaction according to an embodiment of the present application, and the method is applied to the terminal device, and includes steps S710 to S760.

Step S710: and acquiring scene data acquired by the acquisition device.

The scene data is data acquired in real time.

Step S720: and if the target user exists in the scene according to the scene data, processing the scene data to acquire the relative position of the target user and the display screen.

In some embodiments, the scene data may include a scene image, and the head information in the scene image may be identified; acquiring the number of users in a scene image according to the head information; if the number of the users is one, the identified users are taken as target users; the scene image is processed to obtain the relative position of the target user and the display screen. Optionally, if the number of the users is multiple, monitoring whether an interactive instruction input by the user is acquired; and if the interactive instruction input by the user is acquired, taking the user corresponding to the interactive instruction as a target user. In particular, please refer to the foregoing embodiments.

Step S730: and if the target user is located in the preset area, acquiring a target simulation digital human image corresponding to the relative position based on the preset simulation digital human model.

Step S740: and displaying the target simulation digital human image on a display screen.

Step S750: if a change in the relative position is detected, a new target simulated digital human image is generated based on the changed relative position.

After the target simulated digital human image is displayed on the display screen, the relative position between the target user and the display screen can be detected in real time, and if the change of the relative position is detected, a new target simulated digital human image is generated according to the changed relative position. By detecting the change of the relative position, the corresponding target simulation digital person can be generated according to the real-time relative position of the user, so that the simulation digital person faces the target user at every moment, and the interaction is more natural and vivid.

In some embodiments, the target simulated digital human image displayed on the display screen is not updated if the change in relative position within a preset time is less than a preset threshold. The preset threshold may be at least one of a displacement threshold and a rotation angle threshold. Specifically, the change parameters of the target user whose position is changed within the preset time relative to the initial relative position may be determined, where the change parameters include a phase shift parameter and a rotation angle parameter, and if the change parameters are smaller than a corresponding preset threshold, a new target simulation digital human image is not generated according to the changed relative position, and an image displayed on the display screen is not updated. Therefore, the new target simulation digital person image is obtained only when the position change of the target user in the preset time is larger than the preset threshold value, so that under the condition that the position posture change of the user in the preset time is not large, the new target simulation digital person does not need to be determined, the orientation of the displayed simulation digital person is adjusted in real time according to the change of the relative position of the target user to interact more naturally, and the calculation power and the power consumption for generating the simulation digital person in real time are saved.

In some embodiments, an image sequence including a plurality of target simulated digital human images may also be generated according to the relative position of the target user before the change and the relative position after the change, that is, a plurality of images with time sequence changed from the previous target simulated digital human image to the target simulated digital human image corresponding to the relative position after the change are acquired. Simulated digital human video can be generated from image sequences and timing to present a gradually changing dynamic simulated digital human. For example, when the relative position between the user and the display screen is changed, the viewing angle of the target user looking at the display screen is changed, the image of the simulated digital person seen by the target user is also switched, and the digital person displayed on the display screen has the effect of a video shot by a walking camera in a circle around the real mannequin, so that the visual effect of a three-dimensional real mannequin is presented.

In some embodiments, when it is detected that the target user leaves the preset area, the new target simulated digital human image may be a simulated digital human image in a preset to-be-awakened state. Meanwhile, the state of the terminal equipment can be switched to a state to be awakened, and power consumption required by real-time interaction is reduced. Alternatively, when the detection target user leaves the preset area, the simulated digital human image of the preset action may also be taken as a new target simulated digital human image, such as waving a note or the like.

Step S760: displaying the new target simulated digital human image on the display screen.

Displaying the new target simulated digital human image on the display screen. In some embodiments, a digital human video generated from an image sequence including a plurality of target simulated digital human images may also be displayed.

It is understood that steps S710 to S750 may be performed locally by the terminal device, may also be performed in the server, and may also be performed by the terminal device and the server separately, and according to different practical application scenarios, tasks may be allocated according to requirements, which is not limited herein.

The method for simulating 3D digital human interaction provided by this embodiment includes acquiring scene data acquired by an acquisition device, processing the scene data to acquire a relative position between a target user and a display screen if the target user is determined to be present in a scene according to the scene data, acquiring a target simulated digital human image corresponding to the relative position based on a preset simulated digital human model if the target user is located in a preset region, displaying the target simulated digital human image on the display screen, and generating a new target simulated digital human image according to the changed relative position if a change in the relative position is detected, and displaying the new target simulated digital human on the display screen. The target simulation digital person image is updated in real time according to the relative position of the user and the display screen by detecting the position of the user in real time, so that real-time ground-to-ground interaction between the target user and the simulation digital person is realized.

It should be understood that the foregoing examples are merely illustrative of the application of the method provided in the embodiments of the present application in a specific scenario, and do not limit the embodiments of the present application. The method provided by the embodiment of the application can also be used for realizing more different applications.

Referring to fig. 9, fig. 9 is a block diagram illustrating a structure of an emulated 3D digital human interaction device 800 according to an embodiment of the present application. As will be explained below with respect to the block diagram shown in fig. 9, the emulation 3D-based digital human interaction device 800 includes: data acquisition module 810, position acquisition module 820, image acquisition module 830 and display module 840, wherein:

the data acquisition module 810 is used for acquiring scene data acquired by the acquisition device; a position obtaining module 820, configured to, if it is determined that a target user exists in a scene according to the scene data, process the scene data to obtain a relative position between the target user and a display screen; an image obtaining module 830, configured to obtain, based on a preset simulated digital human model, a target simulated digital human image corresponding to the relative position if the target user is located in a preset region, where the target simulated digital human image includes a simulated digital human whose face faces the target user, and the preset region is a region where a distance between the preset region and the display screen is smaller than a preset numerical value; a display module 840 for displaying the target simulated digital human image on the display screen.

Further, the preset simulation digital human model is a model obtained by training in advance according to a plurality of sample images containing a real human model and reference parameters corresponding to each sample image, the simulation digital human model is used for outputting a simulation digital human image corresponding to the sample image according to the input reference parameter, the image acquisition module 830 comprises a parameter determination sub-module and a parameter input sub-module, wherein the parameter determination submodule is used for determining a target reference parameter in a plurality of preset reference parameters according to the relative position, the reference parameters are used for characterizing the pose of the real-person model contained in the sample image relative to an image acquisition device acquiring the sample image, the parameter input submodule is used for inputting the target reference parameter into the preset simulation digital human model and taking the output simulation digital human image as the target simulation digital human image.

Further, the parameter determination submodule includes a first parameter determination unit and a second parameter determination unit, where the first parameter determination unit is configured to determine a user viewing angle parameter according to the relative position, and the user viewing angle parameter is used to represent a viewing angle of the target user toward a preset position of the display screen; the second parameter determining unit is configured to determine the target reference parameter according to the user perspective parameter in the preset multiple reference parameters.

Further, the first parameter determining unit comprises a position determining subunit and a viewing angle parameter determining subunit, wherein the position determining subunit is configured to determine a target display position of the display screen according to the relative position, and the target display position is a display position of the target simulated digital human image on the display screen; and the visual angle parameter determining subunit is configured to determine the user visual angle parameter according to the relative position and the target display position.

Further, the simulation 3D digital human interaction device 800 further includes an interaction information obtaining module and a voice information obtaining module, the interaction information obtaining module is configured to obtain interaction information, the voice information obtaining module is configured to process the interaction information to obtain response voice information, the parameter input sub-module includes an image sequence obtaining unit, the image sequence obtaining unit is configured to input the target reference parameter and the response voice information into the preset simulation digital human model to obtain an output image sequence, the image sequence is formed by multiple continuous frames of the target simulation digital human images, the display module 840 includes a video output unit, and the video output unit is configured to generate and output a video of a simulation digital human according to the image sequence and synchronously play the response voice information.

Further, the simulated digital human model comprises a feature generation model and an image generation model, the image sequence acquisition unit comprises an initial feature parameter acquisition subunit, a parameter sequence acquisition subunit and an image sequence acquisition subunit, wherein the initial feature parameter acquisition subunit is used for inputting the target reference parameter into the feature generation model to acquire an initial feature parameter, and the initial feature parameter is used for representing the form of the real human model corresponding to the sample image; the parameter sequence acquiring subunit is configured to adjust at least one of an expression parameter, an action parameter, and a mouth shape parameter of the initial feature parameter according to the response voice information to obtain a parameter sequence, where the parameter sequence includes a plurality of target feature parameters; the image sequence obtaining subunit is configured to obtain, based on the image generation model, a target simulated digital human image corresponding to each target characteristic parameter, so as to obtain the image sequence corresponding to the parameter sequence.

Further, the orientation angle of the simulated digital person in the target simulated digital person image is the same as the orientation angle of the real-person model in the sample image corresponding to the target reference parameter.

Further, the physical and appearance features of the simulated digital person in the target simulated digital person image are the same as the physical and appearance features of the real person model in the sample image corresponding to the target reference parameter.

Further, the position obtaining module 820 includes a judging sub-module, a coordinate obtaining sub-module, a transformation relation determining sub-module, and a position determining sub-module, where the judging sub-module is configured to judge whether the target user exists in the scene image, and the coordinate obtaining sub-module is configured to, if yes, identify the scene image to obtain a three-dimensional coordinate of the target user in a camera coordinate system, where the camera coordinate system uses the position of the collecting device as an origin; the conversion relation determining submodule is used for acquiring the position relation between the acquisition device and the display screen and determining the conversion relation between the camera coordinate system and a space coordinate system according to the position relation, wherein the space coordinate system takes the position of the display screen as an origin; the position determining submodule is configured to determine the relative position of the target user and the display screen in the spatial coordinate system based on the conversion relation and the three-dimensional coordinate, where the relative position includes at least one of a relative distance and a relative angle.

Further, the position obtaining module 820 further includes an image recognition sub-module, a user number obtaining sub-module, and a first processing sub-module, where the image recognition sub-module is configured to recognize the number of users in the scene image, the user number obtaining sub-module is configured to obtain the number of users in the scene image according to the number of users, and the first processing sub-module is configured to take the identified user as the target user if the number of users is one.

Further, the simulation 3D digital human interaction device 800 further includes an instruction monitoring sub-module and a second processing sub-module, where the instruction monitoring sub-module is configured to monitor whether to obtain an interaction instruction input by the user if the number of the users is multiple; and the second processing sub-module is used for taking the user corresponding to the interactive instruction as the target user if the interactive instruction input by the user is obtained.

Further, the scene data is data collected in real time, and after the target simulated digital human image is displayed on the display screen, the simulated 3D digital human interaction device 800 further includes a position detection module and a display update module, where the position detection module is configured to generate a new target simulated digital human image according to a changed relative position if the change of the relative position is detected; the display updating module is used for displaying the new target simulation digital human image on the display screen.

Referring to fig. 10, a block diagram of an electronic device according to an embodiment of the present application is shown. The electronic device 900 may be a smart phone, a tablet computer, an electronic book, or other electronic devices capable of running an application. The electronic device 900 in the present application may include one or more of the following components: a processor 910, a memory 920, and one or more applications, wherein the one or more applications may be stored in the memory 920 and configured to be executed by the one or more processors 910, the one or more programs configured to perform a method as described in the aforementioned method embodiments.

Processor 910 may include one or more processing cores. The processor 910 interfaces with various components throughout the electronic device 900 using various interfaces and circuitry to perform various functions of the electronic device 900 and process data by executing or executing instructions, programs, code sets, or instruction sets stored in the memory 920 and invoking data stored in the memory 920. Alternatively, the processor 910 may be implemented in hardware using at least one of Digital Signal Processing (DSP), Field-Programmable Gate Array (FPGA), and Programmable Logic Array (PLA). The processor 910 may integrate one or more of a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), a modem, and the like. Wherein, the CPU mainly processes an operating system, a user interface, an application program and the like; the GPU is used for rendering and drawing display content; the modem is used to handle wireless communications. It is understood that the modem may not be integrated into the processor 910, but may be implemented by a communication chip.

The Memory 920 may include a Random Access Memory (RAM) or a Read-Only Memory (Read-Only Memory). The memory 920 may be used to store instructions, programs, code sets, or instruction sets. The memory 920 may include a stored program area and a stored data area, wherein the stored program area may store instructions for implementing an operating system, instructions for implementing at least one function (such as a touch function, a sound playing function, an image playing function, etc.), instructions for implementing various method embodiments described below, and the like. The data storage area may also store data created during use by the electronic device 900 (e.g., phone books, audio-visual data, chat log data), and so forth.

Referring to fig. 11, a block diagram of a computer-readable storage medium according to an embodiment of the present disclosure is shown. The computer-readable storage medium 1000 stores program code that can be called by a processor to execute the methods described in the above-described method embodiments.

The computer-readable storage medium 1000 may be an electronic memory such as a flash memory, an electrically-erasable programmable read-only memory (EEPROM), an erasable programmable read-only memory (EPROM), a hard disk, or a ROM. Alternatively, the computer-readable storage medium 1000 includes a non-volatile computer-readable storage medium. The computer readable storage medium 1000 has storage space for program code 1010 for performing any of the method steps described above. The program code can be read from or written to one or more computer program products. The program code 1010 may be compressed, for example, in a suitable form.

Finally, it should be noted that: the above embodiments are only used to illustrate the technical solutions of the present application, and not to limit the same; although the present application has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; such modifications and substitutions do not necessarily depart from the spirit and scope of the corresponding technical solutions in the embodiments of the present application.

Claims

1. A method for simulating 3D digital human interaction, comprising:

acquiring scene data acquired by an acquisition device;

if the target user exists in the scene according to the scene data, processing the scene data to obtain the relative position between the target user and the display screen, wherein the relative position comprises a relative distance;

if the relative distance between the target user and the display screen is smaller than a preset numerical value, the target user is located in a preset area, and a target reference parameter is determined in a plurality of preset reference parameters according to the relative position, wherein the reference parameters are used for representing the pose of a real mannequin contained in a sample image relative to an image acquisition device for acquiring the sample image;

inputting the target reference parameters into a preset simulated digital human model, wherein the preset simulated digital human model comprises a feature generation model and an image generation model, acquiring feature parameters of a real human model in a sample image corresponding to the target reference parameters based on the feature generation model, and generating a target simulated digital human image corresponding to the feature parameters based on the image generation model, wherein the target simulated digital human image comprises a simulated digital human with a face facing the target user, and the preset simulated digital human model is a deep learning model which is obtained in advance according to a plurality of sample images containing the real human model and reference parameters corresponding to each sample image;

displaying the target simulated digital human image on the display screen.

2. The method according to claim 1, wherein the determining a target reference parameter among a plurality of preset reference parameters according to the relative position comprises:

determining a user visual angle parameter according to the relative position, wherein the user visual angle parameter is used for representing the visual angle of the target user towards the preset position of the display screen;

and determining the target reference parameter according to the user view angle parameter in the preset multiple reference parameters.

3. The method of claim 2, wherein determining a user perspective parameter from the relative position comprises:

determining a target display position of the display screen according to the relative position, wherein the target display position is the display position of the target simulation digital human image on the display screen;

and determining the user view angle parameter according to the relative position and the target display position.

4. The method of claim 1, wherein prior to said entering said target reference parameters into a preset simulated digital human model, said method further comprises:

acquiring interactive information;

processing the interactive information to obtain response voice information;

the inputting of the target reference parameter into a preset simulation digital human model comprises:

inputting the target reference parameters and the response voice information into the preset simulation digital human model to obtain an output image sequence, wherein the image sequence is composed of a plurality of continuous target simulation digital human images;

the displaying the target simulated digital human image on the display screen includes:

and generating and outputting a video of the simulated digital person according to the image sequence, and synchronously playing the response voice information.

5. The method of claim 4, wherein said inputting said target reference parameter and said response voice message into said preset simulated digital human model to obtain an output image sequence comprises:

inputting the target reference parameters into the feature generation model to obtain initial feature parameters, wherein the initial feature parameters are used for representing the form of the real human model corresponding to the sample image;

adjusting at least one parameter of expression parameters, action parameters and mouth shape parameters of the initial characteristic parameters according to the response voice information to obtain a parameter sequence, wherein the parameter sequence comprises a plurality of target characteristic parameters;

and acquiring a target simulation digital human image corresponding to each target characteristic parameter based on the image generation model so as to obtain the image sequence corresponding to the parameter sequence.

6. The method according to any one of claims 1-5, wherein the orientation angle of the simulated digital person in the target simulated digital person image is the same as the orientation angle of the real-person model in the sample image corresponding to the target reference parameter.

7. The method of claim 6, wherein the physical features of the simulated digital person in the target simulated digital person image are the same as the physical features of the real-person model in the sample image corresponding to the target reference parameter.

8. The method according to any one of claims 1 to 5, wherein the scene data includes a scene image, and the processing the scene data to obtain the relative position of the target user and the display screen if the target user is determined to be present in the scene according to the scene data includes:

judging whether the target user exists in the scene image or not;

if so, identifying the scene image to acquire a three-dimensional coordinate of the target user in a camera coordinate system, wherein the camera coordinate system takes the position of the acquisition device as an origin;

acquiring the position relation between the acquisition device and the display screen, and determining the conversion relation between the camera coordinate system and a space coordinate system according to the position relation, wherein the space coordinate system takes the position of the display screen as an origin;

determining the relative position of the target user and the display screen in the spatial coordinate system based on the transformed relationship and the three-dimensional coordinates, the relative position including at least one of a relative distance and a relative angle.

9. The method of claim 1, wherein the scene data comprises a scene image, and if it is determined from the scene data that a target user is present in the scene, processing the scene data to obtain a relative position of the target user and the display screen comprises:

identifying head information in the scene image;

acquiring the number of users in the scene image according to the head information;

if the number of the users is one, taking the identified user as the target user;

and processing the scene image to acquire the relative position of the target user and the display screen.

10. The method of claim 9, further comprising:

if the number of the users is multiple, monitoring whether an interactive instruction input by the user is acquired;

and if the interactive instruction input by the user is acquired, taking the user corresponding to the interactive instruction as the target user.

11. The method of claim 1, wherein the scene data is data collected in real-time, and wherein after displaying the target simulated digital human image on the display screen, the method further comprises:

if the relative position is detected to change, generating a new target simulation digital human image according to the changed relative position;

displaying the new target simulated digital human image on the display screen.

12. An apparatus for emulating 3D digital human interaction, comprising:

the data acquisition module is used for acquiring scene data acquired by the acquisition device;

the position acquisition module is used for processing the scene data to acquire the relative position between the target user and the display screen if the target user exists in the scene according to the scene data, wherein the relative position comprises a relative distance;

the parameter determining module is used for determining a target reference parameter in a plurality of preset reference parameters according to the relative position if the relative distance between the target user and the display screen is smaller than a preset numerical value, wherein the reference parameter is used for representing the pose of a real mannequin contained in a sample image relative to an image acquisition device for acquiring the sample image;

the image acquisition module is used for inputting the target reference parameters into a preset simulated digital human model, the preset simulated digital human model comprises a feature generation model and an image generation model, the feature parameters of a real human model in a sample image corresponding to the target reference parameters are acquired based on the feature generation model, and a target simulated digital human image corresponding to the feature parameters is generated based on the image generation model, wherein the target simulated digital human image comprises a simulated digital human with a face facing the target user, and the preset simulated digital human model is a deep learning model which is obtained in advance through training according to a plurality of sample images containing the real human model and the reference parameters corresponding to each sample image;

and the display module is used for displaying the target simulation digital human image on the display screen.

13. An electronic device, comprising:

one or more processors;

a memory;

one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by the one or more processors, the one or more programs configured to perform the method of simulating 3D digital human interaction of any of claims 1-11.

14. A computer-readable storage medium having program code stored therein, the program code being callable by a processor to perform the method of simulating 3D digital human interaction according to any of claims 1-11.