CN117523088A - Personalized three-dimensional digital human holographic interaction forming system and method - Google Patents

Personalized three-dimensional digital human holographic interaction forming system and method Download PDF

Info

Publication number
CN117523088A
CN117523088A CN202311455785.7A CN202311455785A CN117523088A CN 117523088 A CN117523088 A CN 117523088A CN 202311455785 A CN202311455785 A CN 202311455785A CN 117523088 A CN117523088 A CN 117523088A
Authority
CN
China
Prior art keywords
dialogue
voice
user
interaction
module
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311455785.7A
Other languages
Chinese (zh)
Inventor
惠一龙
殷圣
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Xidian University
Original Assignee
Xidian University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Xidian University filed Critical Xidian University
Priority to CN202311455785.7A priority Critical patent/CN117523088A/en
Publication of CN117523088A publication Critical patent/CN117523088A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T17/00Three dimensional [3D] modelling, e.g. data description of 3D objects
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T13/00Animation
    • G06T13/203D [Three Dimensional] animation
    • G06T13/403D [Three Dimensional] animation of characters, e.g. humans, animals or virtual beings
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T19/00Manipulating 3D models or images for computer graphics
    • G06T19/006Mixed reality
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Graphics (AREA)
  • Software Systems (AREA)
  • Geometry (AREA)
  • Computer Hardware Design (AREA)
  • General Engineering & Computer Science (AREA)
  • Processing Or Creating Images (AREA)

Abstract

The invention discloses a personalized three-dimensional digital human holographic interaction forming system and a method, wherein a presentation interaction module receives request information of a user. The model generation module carries out three-dimensional personalized image modeling on the target person. The voice recognition module carries out emotion recognition on the user and converts the voice interaction information into corresponding text. The dialog generation module generates corresponding dialog interaction text. The speech generation module synthesizes the speech reply dialog audio. The action generating module generates a three-dimensional virtual image posture model of lip line sound synchronization. The presenting interaction module presents the three-dimensional virtual image gesture through the terminal equipment and interacts with the voice of the user. The invention further enriches and vividly reflects the characteristics, behavior habits and dialogue characteristics of the virtualized object, so that the virtual digital person becomes a more personalized and close partner, and the users in the real world can enjoy the sense of reality brought by virtual-real interaction in an immersive way.

Description

Personalized three-dimensional digital human holographic interaction forming system and method
Technical Field
The invention belongs to the technical field of computers, and further relates to a personalized three-dimensional digital human holographic interaction forming system and method in the technical field of computer vision. The invention can be used for realizing real-time voice dialogue interaction between the personalized three-dimensional virtual image holographic projection system and the user, and realizing accompany to the user.
Background
With the development of metauniverse, people combine holographic projection technology and AI digital man-made technology, realize that three-dimensional virtual image holographic projection system carries out real-time pronunciation dialogue interaction with the user, break through virtual world and real world's barrier, strengthen digital man's interaction and presentation ability, let real world's user can immersive enjoyment reality that virtual and real interaction brought, realize virtual companion.
The patent literature of the Shenzhen limited company filed on the quick communication (Shenzhen) discloses a method and a device for realizing the accompanying based on the mixed reality technology (the patent application number is CN 20161036528. X, the application publication number is CN 106775198A). The device comprises five modules, namely a model building module, a database module, a receiving module, a processing module and a presenting module. The model building module of the device is used for generating a virtual model of the person. The database module of the device is used for establishing the behavior and response data so as to obtain the database of the corresponding behavior response of the virtualized object. The receiving module of the device is used for receiving a calling instruction and an interaction instruction of a user calling the person. The processing module of the device is used for matching the virtual model corresponding to the calling person after receiving the calling instruction, and matching the corresponding behavior data of the person corresponding to the interaction instruction after receiving the interaction instruction of the user. The presentation module of the device is used for updating the presentation of the person by the behavior data and updating the presentation of the person by the behavior and reaction data by adopting a laser holographic projection technology. The method uses a mixed reality technology, and through interaction between a real person and a virtual world, the efficiency and the effect of the interaction are effectively improved, but the device has the following defects: because the model building module in the device only adopts the appearance data, the behavior and the response data of the virtualized object of the person, namely the real person, provided by the user, the data acquisition of the virtualized object is incomplete, and the language dialogue data, the facial expression data and the like of the virtualized object are not acquired, so that the reality sense brought by the immersive enjoyment of virtual-real interaction of the real user is influenced to a certain extent, and the effect of virtual accompanying is weakened.
The Beijing pine cone electronic limited company proposes a voice interaction method, a voice interaction device and an electronic device in patent documents (patent application number: CN 202110760477.X, application publication number: CN 113452853A) applied by the Beijing pine cone electronic limited company. The method comprises the steps of firstly obtaining physiological characteristic information of a user, determining a three-dimensional virtual character according to the physiological characteristic information of the user, determining the three-dimensional virtual character corresponding to the physiological characteristic information of the user through an image decision model, then receiving voice information of the user through a receiving module, determining prediction information corresponding to the voice information of the user through a gesture decision model, wherein the prediction information is used for determining the gesture of the three-dimensional virtual character interacted with the user, and finally presenting the gesture of the three-dimensional virtual character through a display device of terminal equipment of a presenting module. The method models the user image to obtain the three-dimensional virtual character, carries out language interaction with the real user, and can present the three-dimensional virtual character and actions thereof on the display device, so that the content of the interaction between the user and the terminal equipment is enriched, and the user image is more vivid. However, this method still has the disadvantages: the three-dimensional virtual character determined by the method is based on physiological characteristic information of the user, personalized three-dimensional virtual character modeling cannot be realized, and diversified requirements of the user cannot be met. The voice synthesis used by the method enables the virtualized object to talk mechanically and diffusely, has low similarity with the real voice, and reduces the sense of reality brought by the immersive enjoyment of virtual-real interaction of the user.
Disclosure of Invention
The invention aims to provide a personalized three-dimensional digital human holographic interaction forming system method aiming at the defects of the prior art. The method is used for solving the problems that the model establishment and acquisition data are incomplete, the personalized three-dimensional virtual object modeling can not be realized, and the real voice similarity in voice interaction is low
The technical idea for realizing the purpose of the invention is as follows: the invention is to collect characteristic information data such as photos, videos and dialogue audios of a target object, extract data such as appearance data, lip movement data, facial expression data, action behavior data, dialogue timbre and dialogue characteristic data of the virtualized object, and then perform three-dimensional virtualized image personalized modeling on the target person according to modeling requirements and actual characteristic information data in user request information, thereby meeting diversified requirements of users. And carrying out emotion recognition on the user according to the voice interaction information in the user request information, and converting the voice interaction information into a corresponding text. And then simulating the dialogue scene of the virtual image and the user according to the emotion and the text content meaning of the user, and generating dialogue interaction text of the virtual image. And synthesizing the voice reply dialogue audio with the unique tone and speaking style of the target character according to the dialogue tone and dialogue characteristic data of the virtualized image and the dialogue interactive text. And then generating a three-dimensional virtualized image posture model with synchronous lip pitch according to the voice reply dialogue audio, and sending the three-dimensional virtualized image posture model with synchronous lip pitch to the presentation interaction module to further enrich and deepen the images and characteristics of the virtualized object and reflect the characteristics and behavior habits of the virtualized object more accurately and vividly. And finally, presenting the three-dimensional virtual image gesture with synchronous lip line pitch through terminal equipment, and interacting with the voice of the user.
The system comprises a model generation module, a voice recognition module, a dialogue generation module, a voice generation module, an action generation module and a presentation interaction module; wherein:
the model generation module is used for carrying out three-dimensional virtual image personalized modeling on the target person according to modeling requirements in the user request information;
the voice recognition module is used for carrying out emotion recognition on the user according to the voice interaction information in the user request information, converting the voice interaction information into corresponding texts and sending the texts to the dialogue generation module;
the dialogue generating module is used for simulating the dialogue scene of the virtual image and the user according to the emotion and the text content implication of the user, generating dialogue interactive text of the virtual image and sending the dialogue interactive text to the voice synthesizing module;
the voice synthesis module is used for synthesizing voice reply dialogue audio with unique tone and speaking style of the target person according to dialogue tone and dialogue characteristic data of the virtualized image and dialogue interactive text, and sending the voice reply dialogue audio to the action generation module;
the action generating module is used for generating a three-dimensional virtual image gesture model with synchronous lip pitch according to the voice reply dialogue audio, and sending the three-dimensional virtual image gesture model with synchronous lip pitch to the presentation interaction module;
the presentation interaction module is used for receiving the request information of a user on a target character, presenting the three-dimensional virtual image gesture with synchronous lip line pitch through the terminal equipment, and interacting with the voice of the user.
The interactive forming method comprises the following specific steps:
step 1, receiving request information of a user for a target person;
step 2, the model generation module performs three-dimensional virtual image personalized modeling on the target person according to modeling requirements in the user request information;
step 3, the voice recognition module carries out emotion recognition on the user according to the voice interaction information in the user request information, and converts the voice interaction information into a corresponding text;
step 4, the dialogue generating module simulates the dialogue scene of the virtual image and the user according to the emotion and the text content meaning of the user, and generates dialogue interaction text of the virtual image;
step 5, the voice synthesis module synthesizes voice reply dialogue audio with unique tone and speaking style of the target person according to dialogue tone and dialogue characteristic data of the virtualized image and dialogue interactive text;
step 6, the action generating module generates a three-dimensional virtual image posture model with synchronous lip row pitch according to the voice replying dialogue audio;
step 7, the presenting interaction module presents the three-dimensional virtual image gesture with synchronous lip line pitch through the terminal equipment and interacts with the voice of the user;
compared with the prior art, the invention has the following advantages:
firstly, the model generating module used by the system of the invention collects the data such as the appearance data, the lip movement data, the facial expression data, the action behavior data, the voice conversation tone and the conversation characteristic data of the virtualized object through the actual characteristic information data such as the photo, the video and the conversation audio of the virtualized object, overcomes the defect that the data of the virtualized object is incompletely collected independently in the prior art, further enriches and deepens the image and the characteristic of the virtualized object, reflects the characteristic and the behavior habit of the virtualized object more accurately and vividly, and enables a user to obtain the image of the vivid virtualized object and increase the affinity.
Secondly, as the three-dimensional virtualized object modeling is used by the method, personalized model establishment is carried out on the characteristic information data such as photos, videos, audios and the like of the virtualized object provided by the user. The method overcomes the defects of single image and simple gesture of the virtualized object in the prior art, meets and adapts to the personalized demands of the user in appearance, enables the virtual digital person to become a more personalized and close partner, improves the interaction experience of the user and the virtualized object, and enhances the participation and investment feeling of the user.
Thirdly, the voice synthesis technology used by the method of the invention simulates the unique tone and speaking habit of the target person by extracting the dialogue tone and dialogue characteristic data from the dialogue audio of the target person, and synthesizes the voice reply dialogue audio with the unique tone and speaking style of the target person by analyzing the emotion and emotion of the dialogue content of the user to simulate the emotion of the virtualized object and combining the dialogue interactive text of the virtualized object. The method overcomes the shortcomings of the prior art that the virtual object dialogue machine and the virtual object dialogue method are simple and convenient, the voice of the virtual object is more real and rich in individuality, the similarity with the voice of a real person is extremely high, the interaction and presentation capability of the virtual object are enhanced, a strict and fine interaction effect is generated, and a user in the real world can enjoy the sense of reality brought by virtual and real interaction in an immersive manner.
Drawings
FIG. 1 is a block diagram of an apparatus of the present invention;
FIG. 2 is a flow chart of the method of the present invention;
FIG. 3 is a schematic flow chart of the three-dimensional avatar creation of the present invention;
FIG. 4 is a flow chart of the three-dimensional virtualized object database of the invention.
Detailed Description
The invention is described in further detail below with reference to the drawings and examples.
The device of the present invention is described in further detail with reference to fig. 1.
The device comprises a model generation module, a voice recognition module, a dialogue generation module, a voice synthesis module, an action generation module and a presentation interaction module. Wherein:
the model generation module is used for carrying out three-dimensional virtual image personalized modeling on the target person according to modeling requirements and actual characteristic information data in the user request information.
The voice recognition module is used for carrying out emotion recognition on the user according to the voice interaction information in the user request information, converting the voice interaction information into corresponding texts and sending the texts to the dialogue generation module.
The dialogue generating module is used for simulating the dialogue scene of the virtual image and the user according to the emotion and the text content meaning of the user, generating dialogue interactive text of the virtual image and sending the dialogue interactive text to the voice synthesizing module.
The voice synthesis module is used for synthesizing voice reply dialogue audio with unique tone and speaking style of the target person according to dialogue tone and dialogue characteristic data of the virtual image and dialogue interactive text, and sending the voice reply dialogue audio to the action generation module.
The action generating module is used for generating a three-dimensional virtual image gesture model with synchronous lip pitch according to the voice reply dialogue audio, and sending the three-dimensional virtual image gesture model with synchronous lip pitch to the presenting interaction module.
The presentation interaction module is used for receiving the request information of a user on a target character, presenting the three-dimensional virtual image gesture with synchronous lip line pitch through the terminal equipment, and interacting with the voice of the user.
The method of the present invention is described in further detail below with reference to fig. 2.
And step 1, inputting codes to call the three-dimensional virtual image, the dialogue tone and the dialogue characteristic data in the three-dimensional virtual object database by a user.
The embodiment of the invention can directly input codes in the three-dimensional virtualized object database for calling the three-dimensional virtualized image model, the dialogue tone and the dialogue characteristic data for the user who has modeled the target character in the three-dimensional personalized way.
And step 2, receiving request information of a user for the target person.
The request information is received by using an intelligent cloud wireless camera Internet of things which is already open in source, wherein the request information comprises modeling requirements of a user on a target person, voice interaction information and collected photos, videos and dialogue audios of the target person, which are received by a presentation receiving module. The intelligent cloud wireless camera internet of things comprises a wireless network camera module, a microphone module, an intelligent cloud gateway and a client (Web), wherein the Web of the intelligent cloud wireless camera internet of things receives voice interaction information of a user on a target person.
And 3, carrying out emotion recognition on the user by the voice recognition module according to the voice interaction information in the user request information, and converting the voice interaction information into a corresponding text.
In the embodiment of the invention, voice interaction information sent by a user is received at the Web end of the Internet of things of the intelligent cloud wireless camera, then the voice recognition module recognizes the voice interaction information content in the cloud server by using a hundred-degree voice recognition tool, analyzes and understands the meaning of the interaction information, and uses the special-to-Text function of the hundred-degree voice recognition tool to map the voice signal of the voice interaction information to a Text sequence to complete the conversion of the voice interaction information into a corresponding Text or command, wherein the Text sequence refers to a group of Text representations consisting of Text characters, words or symbols arranged in a linear left-to-right sequence, and represents the information contained in the voice interaction information.
In the embodiment of the invention, the voice recognition module simultaneously uses the Turing robot which is already in the open source. Aiming at dialogue content identified and analyzed by the hundred-degree identification tool, the emotion identification engine of the Turing robot determines emotion of the user voice dialogue content by extracting language features and acoustic features of voice interaction information, wherein the language features refer to the language information to be expressed by the voice interaction information, and the acoustic features comprise mood, intonation and emotion colors in the user voice interaction information.
And 4, simulating the virtual image and the dialogue scene of the user by the dialogue generating module according to the emotion and the text content meaning of the user, and generating dialogue interaction text of the virtual image.
In the embodiment of the invention, the dialogue interactive text for generating the virtual image by the dialogue generating module has the following two methods.
The first type of dialog interaction text that generates an avatar is to generate dialog interaction text using a pre-trained dialog model. The pre-training dialogue model comprises the following steps:
in a first step, a large-scale corpus database QUAC for training a dialogue model is employed, which consists of approximately 14K crowd-sourced question-answer dialogues and a total of 98K question-answer pairs.
And secondly, adopting a dialogue model t5 base capable of dialogue question answering, wherein the model can capture the statistical characteristics, grammar rules, semantic information and contained emotion information of the language, and can identify the intention of the user so as to determine the operation required to be executed by the user.
Thirdly, based on the QUAC data set, the dialogue model t5 base is subjected to supervised training.
And fourthly, on the basis of the dialogue model, fine tuning is carried out by using the labeled dialogue data, so that the dialogue model is adapted to the corresponding dialogue task. The performance of the model can be further optimized by adjusting the amount of data trimmed, the number of trimming, and the like.
Fifth, a dialog manager is designed according to Natural Language Processing (NLP) technology, which is mainly responsible for maintaining the context of the dialog, ensuring that the dialog model is able to understand the user's intent and accordingly generate dialog interaction text for the avatar.
And sixthly, generating a text corresponding to the voice interaction information of the user into a dialogue interaction text of the virtualized image matched with the corresponding maximum probability value by using the trained dialogue model t5 base.
And seventh, post-processing the generated dialogue interactive text of the avatar to obtain the dialogue interactive text of the processed avatar.
The second type of dialog interactive text that generates an avatar is to use a pre-dialog robot that has been opened at present.
In the embodiment of the invention, the dialog generation module uses an already-open-source Turing robot. The text content of the voice interaction information recognized and analyzed by the hundred-degree recognition tool and the emotion recognition engine of the Turing robot determine emotion of the voice interaction information of the user, a scene dialogue function of the Turing robot starts to build different dialogue scenes and simulate dialogue contexts, emotion responses are carried out, and dialogue interaction texts with virtual images are generated. Thus, the Turing robot simulates human emotion and thinking pattern human conversational interactions with the user according to the understood meaning and the simulated context, generating a conversational interaction text of the avatar.
And 5, synthesizing the voice reply dialogue audio with the unique tone and speaking style of the target person by the voice synthesis module according to the dialogue tone and dialogue characteristic data of the virtualized image and the dialogue interactive text.
In the embodiment of the invention, the speech synthesis module uses Edge-based TTS. The server imports dialogue audio of a target person in the Internet of things of the intelligent cloud wireless camera into a TTS, the TTS recognizes and extracts tone and dialogue characteristic data of the target person, and simulates unique tone and speaking habit of the target person, emotion and emotion of user voice interaction information are analyzed by emotion simulation function of the TTS so as to simulate emotion of an virtualized image, finally voice dialogue synthesis is carried out through text-to-voice function of the TTS, and voice reply dialogue audio with unique tone, speaking style and emotion of the target person is synthesized;
and 6, generating a three-dimensional virtual image posture model with synchronous lip line pitch according to the voice reply dialogue audio by the action generating module.
In the embodiment of the invention, the action generating module generates the three-dimensional virtual image posture model with synchronous lip pitch, and the three-dimensional virtual image posture model has the following two methods.
The first method is to use the processing software of the opened source to perform three-dimensional virtual image modeling.
The action generating module in the embodiment of the invention uses the open source content Lipsync technology in the unit 3D software, and the Oculus Lipsync technology analyzes the voice reply dialogue audio in real time, then predicts a group of pronunciation mouth shapes for lip animation of the virtual image, and realizes the gesture and expression of the lips and the face of the virtual image, thereby realizing the lip synchronization effect.
The second method is to use the lip-line sound synchronization algorithm with open source to perform three-dimensional virtual image modeling in the cloud server.
According to the embodiment of the invention, the action generating module uses an Audio2Face deep learning algorithm which is already in an open source for a lip synchronization algorithm, and the algorithm extracts lip movement data, facial expression data and action behavior data by analyzing video and Audio data of the target person in a database, and draws mouth shapes and facial expressions of a virtualized object according to language reply dialogue Audio of the virtualized object, so that the lip synchronization effect is realized.
According to the embodiment of the invention, the action generating module imports the virtualized object with the lip synchronization effect into 3ds Max software with an opened source, synthesizes corresponding bip actions according to the voice reply dialogue audios of the turing robot and the user, such as actions conforming to dialogue contents, such as running, waving hands, squatting, walking and the like, and finally adds the bip actions into the three-dimensional virtualized object, finally renders and exports a video with an avi format, and obtains a three-dimensional virtualized image model with synchronous lip pitch and precision to realize the lip pitch synchronization effect.
And 7, the presenting interaction module presents the three-dimensional virtual image gesture with synchronous lip line pitch through the terminal equipment and interacts with the voice of the user.
In the embodiment of the invention, the terminal equipment for presenting the interaction module calls the three-dimensional virtual image through the coding of the three-dimensional virtual image, adopts the holographic projection technology to present the three-dimensional virtual image gesture of lip movement sound synchronization, and the three-dimensional virtual image gesture at least comprises one of facial expression, limb action, mouth-lip action and head action of the three-dimensional virtual image. The invention adopts two different holographic projection methods to carry out projection interaction.
The first holographic projection method is a 3D holographic fan projection method.
According to the 3D holographic fan projection method, 3D holographic fan equipment is used, a holographic fan is turned on, wifi is used for being linked with the holographic fan at a mobile phone end, holosccope software is turned on at the mobile phone end, and the front video of the three-dimensional virtual object animation is uploaded, so that 3D holographic fan projection can be achieved.
The second holographic projection method is the pyramid projection method.
In the pyramid projection method provided by the embodiment of the invention, pyramid glass equipment is used for projection, the front, back, left and right four-side videos of the three-dimensional virtual object animation are required to be derived by 3ds Max software in the step 6, PR clipping software is used for clipping the videos into a video of the three-dimensional virtual object, the three-dimensional virtual objects are back to back, the videos are derived, and finally the videos are imported into the pyramid projection equipment for playing, so that pyramid projection can be realized.
In the embodiment of the invention, the voice broadcasting function of the presentation interaction module uses a microphone to broadcast the voice reply dialogue audio of the virtual image to the user in real time.
And 8, collecting the use evaluation information of the user, and optimizing the dialogue, the dialogue and the like of the three-dimensional virtual image.
In the embodiment of the invention, a user feedback module is designed at the Web end of the Internet of things of the intelligent cloud wireless camera, after the user is used, the user can evaluate and score the use at the time of the Internet of things platform, the server collects the use evaluation information of the user, classifies the use evaluation information by keywords, optimizes the three-dimensional virtualized image, dialogue, action, interaction and projection according to different evaluation information, and updates and iterates the three-dimensional virtualized model, dialogue tone and dialogue characteristic data by a three-dimensional virtualized object database background.
The three-dimensional avatar creation of the present invention will be described in further detail with reference to fig. 3.
First, receiving request information of a user for a target person.
In the embodiment of the invention, the request information is received by using the intelligent cloud wireless camera Internet of things which is already open in source, wherein the request information comprises a photo, a video and dialogue audio of a target person collected by a presentation receiving module. The intelligent cloud wireless camera internet of things comprises a wireless network camera module, a microphone module, an intelligent cloud gateway and a client (Web), wherein the wireless network camera and the microphone are used for collecting actual characteristic information data such as photos, videos and dialogue audios of a target person, the photos of the target person are shot for 100, all photos which are collected at intervals of 15 degrees in a range of [0,720] degrees are recorded for 40 sections, each section of video is 10-15 seconds, the shooting angle covers 360-degree visual angles of a body, the video of facial expressions corresponding to different normal limb actions and different emotions of the target person is covered, the dialogue audios are recorded for 400, each audio is 10-15 seconds, and the four different emotions are respectively expressed. The Web terminal of the Internet of things of the intelligent cloud wireless camera receives modeling requirements and voice interaction information of a user on a target person.
And secondly, the model generation module performs three-dimensional virtual image personalized modeling on the target person according to modeling requirements and actual characteristic information data in the user request information.
In the embodiment of the invention, the model generation module acquires the photo of the target person acquired in the Internet of things, and the three-dimensional virtualization image personalized modeling is realized according to the modeling requirement in the user request information by the following two methods.
The first modeling method is to perform three-dimensional virtual image modeling by using character 3D modeling software.
In the embodiment of the invention, the model generation module selects the target person photo in the internet of things of the intelligent cloud wireless camera, wherein the target person photo at least comprises four types of photos of the front side, the left side, the right side and the back side of the target person. The system performs preliminary three-dimensional modeling by utilizing a plug-in HEADSHOT in Character Creator v3.4 against a front picture of a target person, adds individual characteristic data of the target person in a universal human body parameterized model SMPL, performs detail depth complementary modeling on a three-dimensional virtualized object by utilizing other non-front pictures and facial expression data, debugs the modeling and physical characteristics of the person, such as clothes, trousers, hair, limbs, five sense organs and the like, realizes the three-dimensional virtualized image personalized modeling on the target person according to the user modeling requirements and selected scenes, and finally renders and derives the three-dimensional virtualized image. And (3) importing the three-dimensional virtual image into 3ds Max, and performing skeleton binding on the constructed three-dimensional virtual image to drive.
The second modeling method is to perform three-dimensional virtual image modeling in a cloud server by using an open-source three-dimensional character reconstruction algorithm.
In the embodiment of the invention, the model generation module performs deep learning analysis on 100 target person photos shot by the wireless camera by using an open-source ReNF nerve radiation field algorithm in the server. The NeRF algorithm can surround images of people at different angles, calculates three-dimensional space coordinates of a camera at each acquisition angle, sends acquired image sequences and three-dimensional space coordinates corresponding to the acquired image sequences into the NeRF, synthesizes a plurality of new viewing angles, and has more abundant reconstructed model details, thus being the best result for realizing comprehensive complex scene views. Therefore, the NeRF algorithm carries out three-dimensional character modeling according to the acquired image sequences and the corresponding three-dimensional space coordinates, and finally renders and derives a three-dimensional virtualized object. And importing the constructed complete three-dimensional virtualized object into 3ds Max, and performing skeleton binding on the three-dimensional virtualized object by the 3ds Max to drive.
And thirdly, judging whether the user requirements are met.
In the embodiment of the invention, whether the generated model meets the user requirements is judged, if so, the fourth step of the step is executed, and if not, the first step of the step is executed.
Fourth, an avatar of the target person is generated.
The three-dimensional virtualized object database of the invention is described in further detail below with reference to fig. 4.
First, receiving request information of a user for a target person.
In the embodiment of the invention, the request information is received by using the intelligent cloud wireless camera Internet of things which is already open in source, wherein the request information comprises dialogue audio of a target person collected by a presentation receiving module. The intelligent cloud wireless camera internet of things comprises a wireless network camera module, a microphone module, an intelligent cloud gateway and a client (Web), wherein the wireless network camera and the microphone are used for collecting dialogue audios, 400 dialogue audios are recorded, each audio is 10-15 seconds, and four different emotions are respectively expressed, and the intelligent cloud wireless camera internet of things is flat, angry, wounded and happy.
Second, the dialogue audio is analyzed and a three-dimensional avatar is generated.
The embodiment of the invention analyzes the dialogue audio of the target person and extracts the dialogue tone and dialogue characteristic data of the target person.
Embodiments of the present invention generate three-dimensional avatar through photographs and videos of target characters.
And thirdly, judging whether the three-dimensional virtual image, the voice conversation tone and the conversation characteristic data meet the requirements of the user on the target person, if so, executing the fourth step of the step, and if not, executing the first step of the step.
And fourthly, creating a three-dimensional virtualized object database, and storing the three-dimensional virtualized image data and the voice conversation tone and conversation characteristic data codes into the three-dimensional virtualized object database.
In the embodiment of the invention, mySQL software is deployed on a server to construct a three-dimensional virtualized object database, the three-dimensional virtualized object database comprises a data file and a log file, the data file comprises an encoded three-dimensional virtual voice image file, a voice conversation tone and a conversation characteristic data file, and the log file comprises information required for recovering all transactions in the database.
The three-dimensional virtualized object database function in the embodiment of the invention comprises rapid data storage, rapid model retrieval and visual model checking. The three-dimensional virtualized object database carries out coding naming on three-dimensional virtualized image data, voice conversation tone and conversation characteristic data of the same target person by using the same data name to quickly put in storage, all user names can only use English letters, numbers and underlines, the underlines are used as separating characters among words, the user names of three-dimensional virtualized image files and voice conversation data files of different target persons are not repeatable, wherein the three-dimensional virtualized image data are stored in a file form, metadata and spatial indexes are put in storage, and the conversation tone and conversation characteristic data are stored in storage by using FTP. The three-dimensional virtual object database can be directly positioned and browsed on the three-dimensional virtual image model, and the visualization of enlargement, reduction and selection is truly realized.

Claims (10)

1. The personalized three-dimensional digital human holographic interaction forming system comprises a voice recognition module, an action generation module and a presentation interaction module, and is characterized by further comprising a model generation module, a dialogue generation module and a voice synthesis module; wherein:
the model generation module is used for carrying out three-dimensional virtual image personalized modeling on the target person according to modeling requirements and actual characteristic information data in the user request information;
the voice recognition module is used for carrying out emotion recognition on the user according to the voice interaction information in the user request information, converting the voice interaction information into corresponding texts and sending the texts to the dialogue generation module;
the dialogue generating module is used for simulating the dialogue scene of the virtual image and the user according to the emotion and the text content implication of the user, generating dialogue interactive text of the virtual image and sending the dialogue interactive text to the voice synthesizing module;
the voice synthesis module is used for synthesizing voice reply dialogue audio with unique tone and speaking style of the target person according to dialogue tone and dialogue characteristic data of the virtualized image and dialogue interactive text, and sending the voice reply dialogue audio to the action generation module;
the action generating module is used for generating a three-dimensional virtual image gesture model with synchronous lip pitch according to the voice reply dialogue audio, and sending the three-dimensional virtual image gesture model with synchronous lip pitch to the presentation interaction module;
the presentation interaction module is used for receiving the request information of a user on a target character, presenting the three-dimensional virtual image gesture with synchronous lip line pitch through the terminal equipment, and interacting with the voice of the user.
2. The personalized three-dimensional digital human holographic interaction forming method of the system according to claim 1, wherein the personalized modeling of the three-dimensional virtual image is performed, dialogue audio of a target person in a database is utilized, dialogue tone and dialogue characteristic data of the target person are extracted, and voice reply dialogue audio with unique tone and speaking style of the target person is synthesized to perform projection interaction with a user; the interactive forming method comprises the following specific steps:
step 1, receiving request information of a user for a target person;
step 2, the model generation module performs three-dimensional virtual image personalized modeling on the target person according to modeling requirements and actual characteristic information data in the user request information;
step 3, the voice recognition module carries out emotion recognition on the user according to the voice interaction information in the user request information, and converts the voice interaction information into a corresponding text;
step 4, the dialogue generating module simulates the dialogue scene of the virtual image and the user according to the emotion and the text content meaning of the user, and generates dialogue interaction text of the virtual image;
step 5, the voice synthesis module synthesizes voice reply dialogue audio with unique tone and speaking style of the target person according to dialogue tone and dialogue characteristic data of the virtualized image and dialogue interactive text;
step 6, the action generating module generates a three-dimensional virtual image posture model with synchronous lip row pitch according to the voice replying dialogue audio;
and 7, the presenting interaction module presents the three-dimensional virtual image gesture with synchronous lip line pitch through the terminal equipment and interacts with the voice of the user.
3. The personalized three-dimensional digital human holographic interaction forming method according to claim 2, wherein the request information in step 1 refers to four individual characteristic data including modeling requirements of a user on a target person, voice interaction information and collected photos, videos and dialogue audios of the target person received by a receiving module, wherein the four individual characteristic data include appearance data, lip movement data, facial expression data and action behavior data are extracted from the collected photos and videos of the target person; the voice conversation tone and conversation feature data are extracted from the conversation audio of the target person.
4. The personalized three-dimensional digital human holographic interaction forming method according to claim 2, wherein the personalized modeling in the step 2 refers to that the model generating module adds the individual characteristic data of the target person into the universal human body parameterized model SMPL according to the appearance data, the facial expression data and the action behavior data in the four individual characteristic data, and realizes the personalized modeling of the three-dimensional virtualized image of the target person according to the user modeling requirement and the selected scene.
5. The personalized three-dimensional digital human holographic interaction forming method according to claim 2, wherein the emotion recognition in the step 3 means that the voice recognition module determines the emotion state of the user by extracting the language feature and the acoustic feature of the voice interaction information, wherein the language feature means the speech information to be expressed by the voice interaction information, and the acoustic feature includes the mood, the intonation and the emotion color in the voice interaction information of the user.
6. The personalized three-dimensional digital human holographic interactive forming method according to claim 2, wherein the conversion of the voice interaction information into corresponding text in step 3 means that the voice recognition module maps the voice signal of the voice interaction information into a text sequence to complete the conversion of the voice interaction information into corresponding text, wherein the text sequence means a set of text representations composed of text characters, words or symbols arranged in a linear left-to-right order, representing the information contained in the voice interaction information.
7. The personalized three-dimensional digital human holographic interactive formation method according to claim 2, wherein the step of generating the dialogue interactive text of the avatar in step 4 is as follows:
the first step, a large-scale corpus database QUAC for training a dialogue model is adopted, and the large-scale corpus database consists of about 14K crowd-sourced question-answer dialogues and 98K question-answer pairs in total;
secondly, adopting a dialogue model t5 base capable of dialogue question answering, wherein the model can capture statistical characteristics, grammar rules, semantic information and contained emotion information of the language;
thirdly, based on the QUAC data set, performing supervised training on the dialogue model t5 base;
fourthly, on the basis of the dialogue model, fine adjustment is carried out by using the marked dialogue data, so that the dialogue model is adapted to the corresponding dialogue task; the performance of the model can be further optimized by adjusting the data amount of fine adjustment, the number of fine adjustment and other modes;
fifthly, designing a dialogue manager according to Natural Language Processing (NLP), wherein the dialogue manager is responsible for maintaining the context of dialogue, ensuring that a dialogue model can understand the intention of a user and correspondingly generate dialogue interactive texts of an virtualized image;
sixth, using the trained dialogue model to generate the input of the user into the dialogue interactive text of the virtual image matched with the corresponding maximum probability value;
and seventh, post-processing the generated dialogue interactive text of the avatar to obtain the dialogue interactive text of the processed avatar.
8. The personalized three-dimensional digital human holographic interactive formation method according to claim 2, wherein the synthesizing of the voice dialogue audio with the unique tone and speaking style of the target person in step 5 means that the voice synthesis module trains a voiceprint synthesis model capable of capturing voiceprint characteristics of the target person by collecting dialogue voice samples of the target person, and uses a text-to-voice synthesis model for converting input text into voice, and finally fine-tunes text-to-voice synthesis model parameters according to the voiceprint characteristics of the target person obtained in the voiceprint synthesis model, and synthesizes voice reply dialogue audio with the unique tone and speaking style of the target person, wherein the voiceprint characteristics at least comprise five basic characteristics of frequency spectrum, cepstrum, formants, pitch and reflection coefficient of dialogue audio signals.
9. The personalized three-dimensional digital human holographic interaction forming method according to claim 2, wherein the step 6 of generating the three-dimensional virtual image model with synchronous lip pitch refers to analyzing the action data according to lip movement data, facial expression data and action data in four pieces of individual feature data to generate actions corresponding to voice reply dialogue audio signals, and analyzing the lip movement data and the facial expression data to realize high-precision lip synchronization by adopting a lip synchronization model, so as to obtain the three-dimensional virtual image model with synchronous lip pitch.
10. The personalized three-dimensional digital human holographic interaction forming method according to claim 2, wherein the terminal device in step 7 is a three-dimensional virtual image gesture of lip movement synchronization presented by holographic projection technology, and the voice broadcast virtual image replies dialogue audio to the voice of the user, wherein the three-dimensional virtual image gesture at least comprises one of facial expression, limb movement, lip movement and head movement of the three-dimensional virtual image.
CN202311455785.7A 2023-11-03 2023-11-03 Personalized three-dimensional digital human holographic interaction forming system and method Pending CN117523088A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311455785.7A CN117523088A (en) 2023-11-03 2023-11-03 Personalized three-dimensional digital human holographic interaction forming system and method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311455785.7A CN117523088A (en) 2023-11-03 2023-11-03 Personalized three-dimensional digital human holographic interaction forming system and method

Publications (1)

Publication Number Publication Date
CN117523088A true CN117523088A (en) 2024-02-06

Family

ID=89741075

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311455785.7A Pending CN117523088A (en) 2023-11-03 2023-11-03 Personalized three-dimensional digital human holographic interaction forming system and method

Country Status (1)

Country Link
CN (1) CN117523088A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117727303A (en) * 2024-02-08 2024-03-19 翌东寰球(深圳)数字科技有限公司 Audio and video generation method, device, equipment and storage medium
CN117808945A (en) * 2024-03-01 2024-04-02 北京烽火万家科技有限公司 Digital person generation system based on large-scale pre-training language model

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117727303A (en) * 2024-02-08 2024-03-19 翌东寰球(深圳)数字科技有限公司 Audio and video generation method, device, equipment and storage medium
CN117808945A (en) * 2024-03-01 2024-04-02 北京烽火万家科技有限公司 Digital person generation system based on large-scale pre-training language model

Similar Documents

Publication Publication Date Title
CN110688911B (en) Video processing method, device, system, terminal equipment and storage medium
JP6019108B2 (en) Video generation based on text
US20120130717A1 (en) Real-time Animation for an Expressive Avatar
CN117523088A (en) Personalized three-dimensional digital human holographic interaction forming system and method
CN110610534B (en) Automatic mouth shape animation generation method based on Actor-Critic algorithm
CN108492817A (en) A kind of song data processing method and performance interactive system based on virtual idol
JP2014519082A5 (en)
CN112668407A (en) Face key point generation method and device, storage medium and electronic equipment
CN112652041B (en) Virtual image generation method and device, storage medium and electronic equipment
JP2023552854A (en) Human-computer interaction methods, devices, systems, electronic devices, computer-readable media and programs
CN116311456A (en) Personalized virtual human expression generating method based on multi-mode interaction information
CN116704085B (en) Avatar generation method, apparatus, electronic device, and storage medium
JP6796762B1 (en) Virtual person dialogue system, video generation method, video generation program
KR20170135598A (en) System and Method for Voice Conversation using Synthesized Virtual Voice of a Designated Person
CN116524791A (en) Lip language learning auxiliary training system based on meta universe and application thereof
CN112235180A (en) Voice message processing method and device and instant messaging client
CN117135331A (en) Method and system for generating 3D digital human video
CN114898018A (en) Animation generation method and device for digital object, electronic equipment and storage medium
Verma et al. Animating expressive faces across languages
CN117560340B (en) Information interaction method, device and storage medium based on simulated roles
US20240177283A1 (en) System and method for an audio-visual avatar evaluation
CN116843805B (en) Method, device, equipment and medium for generating virtual image containing behaviors
JP7496128B2 (en) Virtual person dialogue system, image generation method, and image generation program
CN116226411B (en) Interactive information processing method and device for interactive project based on animation
CN116912377A (en) Interactive multi-mode stylized two-dimensional digital face animation generation method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination