CN114760425A - Digital human generation method, device, computer equipment and storage medium - Google Patents

Digital human generation method, device, computer equipment and storage medium Download PDF

Info

Publication number
CN114760425A
CN114760425A CN202210285154.4A CN202210285154A CN114760425A CN 114760425 A CN114760425 A CN 114760425A CN 202210285154 A CN202210285154 A CN 202210285154A CN 114760425 A CN114760425 A CN 114760425A
Authority
CN
China
Prior art keywords
cache
video frame
content
digital
text content
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210285154.4A
Other languages
Chinese (zh)
Inventor
左佳伟
朱海涛
王林芳
石凡
张琪
张炜
申童
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Jingdong Technology Information Technology Co Ltd
Original Assignee
Jingdong Technology Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Jingdong Technology Information Technology Co Ltd filed Critical Jingdong Technology Information Technology Co Ltd
Priority to CN202210285154.4A priority Critical patent/CN114760425A/en
Publication of CN114760425A publication Critical patent/CN114760425A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N5/00Details of television systems
    • H04N5/222Studio circuitry; Studio devices; Studio equipment
    • H04N5/262Studio circuits, e.g. for mixing, switching-over, change of character of image, other special effects ; Cameras specially adapted for the electronic generation of special effects
    • H04N5/265Mixing
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/02Methods for producing synthetic speech; Speech synthesisers
    • G10L13/04Details of speech synthesis systems, e.g. synthesiser structure or memory management
    • G10L13/047Architecture of speech synthesisers
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L13/00Speech synthesis; Text to speech systems
    • G10L13/08Text analysis or generation of parameters for speech synthesis out of text, e.g. grapheme to phoneme translation, prosody generation or stress or intonation determination
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/06Transformation of speech into a non-audible representation, e.g. speech visualisation or speech processing for tactile aids
    • G10L21/10Transforming into visible information
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N21/00Selective content distribution, e.g. interactive television or video on demand [VOD]
    • H04N21/80Generation or processing of content or additional data by content creator independently of the distribution process; Content per se
    • H04N21/85Assembly of content; Generation of multimedia applications
    • H04N21/854Content authoring
    • H04N21/8547Content authoring involving timestamps for synchronizing content
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04NPICTORIAL COMMUNICATION, e.g. TELEVISION
    • H04N5/00Details of television systems
    • H04N5/222Studio circuitry; Studio devices; Studio equipment
    • H04N5/262Studio circuits, e.g. for mixing, switching-over, change of character of image, other special effects ; Cameras specially adapted for the electronic generation of special effects
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/06Transformation of speech into a non-audible representation, e.g. speech visualisation or speech processing for tactile aids
    • G10L21/10Transforming into visible information
    • G10L2021/105Synthesis of the lips movements from speech, e.g. for talking heads

Abstract

The application discloses a digital human generation method and device, computer equipment and a storage medium, and relates to the field of human-computer interaction. The specific implementation scheme is as follows: responding to the received interactive request, and determining interactive text content of the user according to the interactive request; determining interactive feedback text content of the digital person according to the interactive text content; responding to the interactive feedback text content existing in a cache information database, and acquiring a first cache audio packet and a first cache video frame which are cached in advance and correspond to the interactive feedback text content; carrying out audio and video synthesis on a pre-recorded silent respiratory state video frame, a first cache audio packet and a first cache video frame of the digital person to obtain synthetic audio and video data of the digital person; and playing the synthesized audio and video data of the digital person on the man-machine interaction interface so as to respond to the interaction request. The method and the device can reduce the waste of computing resources for synthesizing the audio and video by adopting a real-time generation mode in the fixed telephone technology, and improve the real-time performance of digital human interaction to a certain extent.

Description

Digital human generation method, device, computer equipment and storage medium
Technical Field
The present application relates to the field of computer technologies, and further relates to the field of human-computer interaction technologies, and in particular, to a method and an apparatus for generating a digital person, a computer device, and a storage medium.
Background
Digital people, i.e. the technical means of computer vision or computer graphics, generate character models in real person style or cartoon image. The user can interact with the digital person in the forms of voice, text and the like, the digital person drives the change of facial expressions, mouth shapes and body actions through an algorithm, and the digital person interacts with the user and gives responses by matching with sound. At present, digital people are widely used in government affairs, finance, scenic spots, e-commerce and other scenes, such as providing explanation services in scenic spots, providing customer consultation services on e-commerce websites and the like.
Current digital people can be divided into two types, the anthropomorphic cartoon style and the real person style. The digital person in the anthropomorphic cartoon style adopts the technical scheme of computer graphics, and has the advantages of high rendering speed and large adjustable parameter space compared with a digital person in a real person style. However, the digital human with the anthropomorphic cartoon style needs a great deal of design and modeling work in the early stage, has high cost and is not suitable for serious delivery scenes such as government affair scenes. The real person style digital person is mainly realized based on the related technology of computer vision, and based on deep learning, the facial area of the digital person is generated through real-time rendering of an algorithm model and is fused with pre-recorded figure materials, so that the change of the expression and the mouth shape of the figure can be realized.
Real-time interaction is a basic function of a digital human system, that is, after a user inputs a text or voice, a digital human can quickly give a response (including playing sound and changing mouth shape, expression and limbs). When the current real-person style digital person based on deep learning carries out real-time interaction, a GPU (graphics processing unit) resource needs to be used, and the face area of the digital person is generated through real-time rendering of an algorithm model. Besides, a TTS (text to speech) model is called to generate a voice corresponding to the response text in real time. However, the current real person style digital human interaction mode usually wastes computing resources.
Disclosure of Invention
The present application is directed to solving, at least to some extent, one of the technical problems in the related art.
Therefore, a first objective of the present application is to provide a method for generating a digital human, in an interaction process of a digital human, for an interactive feedback text content existing in a cache information database, obtaining an audio packet and a video frame corresponding to the interactive feedback text content, which are cached in advance, and synthesizing and playing the audio packet and the video frame with a frame corresponding to a silent respiratory state video frame, so as to reduce the waste of computing resources for synthesizing an audio and video by adopting a real-time generation method for a fixed telephone technology, and improve the real-time performance of digital human interaction to a certain extent.
A second object of the present application is to propose a digital person generating device.
A third object of the present application is to propose a computer device.
A fourth object of the present application is to propose a computer-readable storage medium.
In order to achieve the above object, an embodiment of a first aspect of the present application provides a digital person generation method, including:
responding to a received interaction request, and determining the interaction text content of a user according to the interaction request;
determining interactive feedback text content of the digital person according to the interactive text content
Responding to the interactive feedback text content existing in a cache information database, and acquiring a first cache audio packet and a first cache video frame which are cached in advance and correspond to the interactive feedback text content;
carrying out audio and video synthesis on the pre-recorded silent respiratory state video frame of the digital person, the first cache audio packet and the first cache video frame to obtain synthetic audio and video data of the digital person;
and playing the synthesized audio and video data of the digital person on a human-computer interaction interface so as to respond to the interaction request.
In some embodiments of the present application, the method for generating a digital person further comprises:
In response to the interactive feedback text content not existing in the cache information database, generating a target audio packet, phonemes, and timestamps of the interactive feedback text content;
generating corresponding expression parameters according to the phonemes and the time stamps, and generating a facial region structure according to the expression parameters and specific parameters of the digital human face;
generating a facial region image of the digital person from the facial region structure;
combining the face area image with a pre-generated face area mask, and fusing the face area image with character materials in the silent respiratory state video frame to obtain a synthesized digital human video frame;
and synthesizing the target audio packet and the digital human video frame to obtain the synthesized audio and video data of the digital human.
In some embodiments of the present application, the digital human generation method further comprises:
in response to the interactive feedback text content belonging to the high frequency telephony, caching the interactive feedback text content;
generating a face region cache image sequence corresponding to each cache insertion point according to a plurality of cache insertion points on a silent respiratory state video frame aiming at the interactive feedback text content; the interval between every two cache insertion points is preset frame number;
Caching the generated target audio packet and the face region cache image sequence corresponding to each cache insertion point; and the face region cache image sequences corresponding to the cache insertion points share one target audio packet of the interactive feedback text content.
In some embodiments of the present application, the first buffered video frame is a face region buffered image; the audio and video synthesis of the pre-recorded silent respiratory state video frame of the digital person, the first cache audio packet and the first cache video frame to obtain the synthetic audio and video data of the digital person comprises the following steps:
synthesizing the face region cache image with a corresponding frame in the silent breathing state video frame to obtain a synthesized video frame;
and aligning the time stamps of the synthesized video frame and the first cache audio packet, and encoding an audio and video packet queue obtained after aligning the time stamps to obtain the synthesized audio and video data of the digital person.
In some embodiments of the present application, obtaining a cached video frame corresponding to the interactive feedback text content, which is cached in advance, includes:
determining a time at which the interactive request is received;
Determining a target buffer insertion point corresponding to the time from a plurality of buffer insertion points on the silent respiratory state video frame;
obtaining a corresponding cache video frame of the interactive feedback text content according to the target cache insertion point;
the synthesizing the face region cache image and a corresponding frame in the silence breathing state video frame to obtain a synthesized video frame includes:
determining a corresponding target video frame from the silent respiratory state video frames according to the target cache insertion point;
and merging the face region cache image and the target video frame to obtain a composite video frame.
In some embodiments of the present application, the obtaining a first buffered audio packet and a first buffered video frame corresponding to the interactive feedback text content, which are cached in advance in response to the interactive feedback text content existing in a cache information database, includes:
determining that the interactive feedback text content comprises fixed speech content and random speech content;
responding to the fixed telephony content in the interactive feedback text content to exist in the cache information database, and acquiring a second cache audio packet and a second cache video frame which are cached in advance and correspond to the fixed telephony content;
Generating a target audio packet, phonemes, and timestamps of the random conversational content among the interactive feedback text content;
according to the phonemes and the time stamps of the random conversational content, rendering and generating a digital human random content frame of the random conversational content;
the method for synthesizing audio and video of the pre-recorded silent respiratory state video frame of the digital person, the first cache audio packet and the first cache video frame to obtain synthesized audio and video data of the digital person comprises the following steps:
and synthesizing the second cache audio packet, the second cache video frame, the target audio packet of the random conversational content, the digital human random content frame and the silent respiratory state video frame of the digital human to obtain the synthetic audio and video data of the digital human.
In some embodiments of the present application, the synthesizing the second buffered audio packet, the second buffered video frame, the target audio packet of the random conversational content, the digital human random content frame, and the digital human silence breathing state video frame to obtain the synthesized audio-video data of the digital human includes:
generating a transition frame between the fixed phone content and random phone content;
And synthesizing the transition frame, the second cache audio packet, the second cache video frame, the target audio packet of the random conversational content, the random content frame of the digital person and the silent respiratory state video frame of the digital person to obtain the synthetic audio and video data of the digital person.
In order to achieve the above object, a second aspect of the present application provides a digital human generating apparatus, including:
the first determining module is used for responding to the received interactive request and determining the interactive text content of the user according to the interactive request;
the second determining module is used for determining interactive feedback text content of the digital person according to the interactive text content;
the acquisition module is used for responding to the existence of the interactive feedback text content in a cache information database and acquiring a first cache audio packet and a first cache video frame which are cached in advance and correspond to the interactive feedback text content;
the first synthesis module is used for carrying out audio and video synthesis on the pre-recorded silent respiratory state video frame of the digital person, the first cache audio packet and the first cache video frame to obtain synthesized audio and video data of the digital person;
and the response module is used for playing the synthetic audio and video data of the digital person on a human-computer interaction interface so as to respond to the interaction request.
In some embodiments of the present application, the digital person generating apparatus further comprises:
a first generating module, responsive to the interactive feedback text content not existing in the cache information database, for generating a target audio packet, a phoneme, and a timestamp of the interactive feedback text content;
the second generation module is used for generating corresponding expression parameters according to the phonemes and the time stamps and generating a facial region structure according to the expression parameters and specific parameters of the digital human face;
a third generating module for generating a facial region image of the digital person from the facial region structure;
the second synthesis module is used for combining the facial region image with a pre-generated facial region mask and fusing the facial region image with the figure materials in the silent respiratory state video frame to obtain a synthesized digital person video frame;
and the third synthesis module is used for synthesizing the target audio packet and the digital human video frame to obtain the synthesized audio and video data of the digital human.
In some embodiments of the present application, the digital person generating apparatus further comprises:
the first cache module is used for caching the interactive feedback text content in response to the fact that the interactive feedback text content belongs to the high-frequency speech technology;
A fourth generating module, configured to generate, according to multiple cache insertion points on a silent respiratory state video frame, a face region cache image sequence corresponding to each cache insertion point for the interactive feedback text content; the interval between every two cache insertion points is preset frame number;
the second cache module is used for caching the generated target audio packet and the facial region cache image sequence corresponding to each cache insertion point; and the face region cache image sequences corresponding to the cache insertion points share one cache audio packet of the interactive feedback text content.
In some embodiments of the present application, the first buffered video frame is a face region buffered image; the first synthesis module is specifically configured to:
synthesizing the face region cache image and a corresponding frame in the silent respiratory state video frame to obtain a synthesized video frame;
and aligning the time stamps of the synthesized video frame and the first cache audio packet, and encoding an audio and video packet queue obtained after aligning the time stamps to obtain the synthesized audio and video data of the digital person.
In some embodiments of the present application, obtaining a cached video frame corresponding to the interactive feedback text content, which is cached in advance, includes:
Determining a time at which the interactive request is received;
determining a target cache insertion point corresponding to the time from a plurality of cache insertion points on the silent respiratory state video frame;
obtaining a corresponding cache video frame of the interactive feedback text content according to the target cache insertion point;
the synthesizing the face region cache image and the corresponding frame in the silent respiratory state video frame to obtain a synthesized video frame includes:
determining a corresponding target video frame from the silent respiratory state video frames according to the target cache insertion point;
and merging the face region cache image and the target video frame to obtain a composite video frame.
In some embodiments of the present application, the obtaining module is specifically configured to:
determining that the interactive feedback text content comprises fixed speech content and random speech content;
responding to the fixed telephony content in the interactive feedback text content to exist in the cache information database, and acquiring a second cache audio packet and a second cache video frame which are cached in advance and correspond to the fixed telephony content;
generating a target audio packet, phonemes, and timestamps of the random conversational content among the interactive feedback textual content;
According to the phonemes and the time stamps of the random conversational content, rendering and generating a digital human random content frame of the random conversational content;
wherein the first synthesis module is specifically configured to:
and synthesizing the second cache audio packet, the second cache video frame, the target audio packet of the random conversation content, the digital human random content frame and the digital human silent breathing state video frame to obtain the synthetic audio and video data of the digital human.
In some embodiments of the present application, the synthesizing the second buffered audio packet, the second buffered video frame, the target audio packet of the random telephony content, the digital human random content frame, and the digital human silence breath state video frame to obtain the synthesized audio and video data of the digital human includes:
generating a transition frame between the fixed phone content and random phone content;
and synthesizing the transition frame, the second cache audio packet, the second cache video frame, the target audio packet of the random conversation content, the digital human random content frame and the silent breathing state video frame of the digital human to obtain the synthetic audio and video data of the digital human.
To achieve the above object, a third aspect of the present application provides a computer device, including:
at least one processor; and
a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,
the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the digital human generation method of the first aspect.
To achieve the above object, a fourth aspect of the present application provides a computer-readable storage medium, where the computer instructions are configured to cause the computer to execute the digital human generating method of the first aspect.
According to the technical scheme of the application, for the fixed telephone operation that the interactive feedback text content exists in the cache information database, the first cache audio packet and the first cache video frame which are cached in advance and correspond to the interactive feedback text content can be directly obtained, and real-time rendering generation is not needed. And carrying out audio and video synthesis on the first cache audio packet, the first cache video frame and a corresponding frame of a silent respiratory state video frame of the digital person to obtain synthetic audio and video data of the digital person, and playing the synthetic audio and video data of the digital person on a human-computer interaction interface to respond to an interaction application of a user, so that the real-time performance of digital person interaction is improved, and the waste of computing resources for synthesizing audio and video by adopting a real-time generation mode for a fixed telephone technology is reduced.
It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present application, nor do they limit the scope of the present application. Other features of the present application will become apparent from the following description.
Drawings
The drawings are included to provide a better understanding of the present solution and are not intended to limit the present application. Wherein:
fig. 1 is a schematic flow chart of a digital person generation method according to an embodiment of the present application;
fig. 2 is a schematic diagram of generating a buffered video frame according to a second embodiment of the present disclosure;
fig. 3 is a schematic flowchart of a digital person generating method according to a third embodiment of the present application;
fig. 4 is a schematic diagram of a digital person generation method provided in the fourth embodiment of the present application;
fig. 5 is a schematic diagram of combining a face region cache image with a target video frame to obtain a composite video frame according to a fifth embodiment of the present application;
fig. 6 is a schematic flowchart of a digital person generation method according to a sixth embodiment of the present application;
fig. 7 is a schematic flowchart of a process for generating a digital human video frame according to a seventh embodiment of the present application;
FIG. 8 is an interaction diagram according to an eighth embodiment of the present disclosure;
Fig. 9 is a schematic flowchart of a method for generating a digital person according to a ninth embodiment of the present application;
fig. 10 is a schematic diagram of a transition frame between fixed-speech content and random-speech content according to an embodiment of the present application;
fig. 11 is a block diagram illustrating a structure of a digital human generating apparatus according to an eleventh embodiment of the present application;
fig. 12 is a block diagram illustrating a digital human generating apparatus according to a twelfth embodiment of the present disclosure;
fig. 13 is a block diagram illustrating a digital human generation apparatus according to a thirteenth embodiment of the present application;
fig. 14 is a block diagram of a computer device to implement the digital human generation method of the embodiment of the present application.
Detailed Description
The following description of the exemplary embodiments of the present application, taken in conjunction with the accompanying drawings, includes various details of the embodiments of the application for the understanding of the same, which are to be considered exemplary only. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present application. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.
In the technical scheme of the application, the processing of acquisition, storage, use, processing, transmission, provision, disclosure and the like of the personal information of the related user conforms to the regulations of related laws and regulations and does not violate the customs of public order.
The application provides a digital person generation method, a digital person generation device, computer equipment and a storage medium, which are mainly applied to an interactive scene of a real person style digital person. For a fixed telephone operation with interactive feedback text content in a cache information database, a first cache audio packet and a first cache video frame which are cached in advance and correspond to the interactive feedback text content can be directly obtained, real-time rendering generation is not needed, the waste of computing resources for synthesizing audio and video by adopting a real-time generation mode for the fixed telephone operation can be reduced, and the real-time performance of digital human interaction is improved to a certain extent. A digital human generation method, apparatus, computer device, and storage medium of embodiments of the present application are described below with reference to the accompanying drawings.
Fig. 1 is a schematic flowchart of a method for generating a digital person according to an embodiment of the present disclosure. It should be noted that the digital person generation method according to the embodiment of the present application can be applied to the digital person generation apparatus according to the embodiment of the present application, and the digital person generation apparatus can be configured on a computer device. As shown in fig. 1, the digital person generation method may include the steps of:
Step 101, in response to the received interactive request, determining the interactive text content of the user according to the interactive request.
It should be noted that the interactive text content may be text content input by the user, text content obtained by performing speech recognition on speech input by the user, text content in a text option selected by the user according to multiple text options, and the like. The present application is not particularly limited thereto.
And 102, determining the interactive feedback text content of the digital person according to the interactive text content.
As an example, the interactive feedback text content corresponding to the interactive text may be determined from the interactive text content through a dialog system. For example, the interactive text content of the user is the greeting "hello", and through the dialog system, it is determined that the interactive feedback text content is "hello, ask what can help you? ".
And 103, responding to the existence of the interactive feedback text content in the cache information database, and acquiring a first cache audio packet and a first cache video frame which are cached in advance and correspond to the interactive feedback text content.
It should be noted that the interactive feedback text content cached in the cache information base may be preset fixed-line techniques, such as greetings, closing words, and some guided replies.
Optionally, in some embodiments of the present application, a corresponding hash value may be generated for the interactive feedback text content, and whether the interactive feedback text content exists in the cache information database is determined by comparing the hash value of the text information in the cache information database with the hash value of the interactive feedback text content.
Optionally, in some embodiments of the present application, the first buffered video frame may be a face region buffered image, a head region buffered image, or the like.
And 104, carrying out audio and video synthesis on the pre-recorded silent respiratory state video frame of the digital person, the first cache audio packet and the first cache video frame to obtain synthetic audio and video data of the digital person.
As an example, a pre-recorded silent breathing state video of a digital person may be a pre-recorded segment of a real person model video. The real person model in the video can have some small actions such as breathing fluctuation, blinking, smiling, head skewing and the like, so that the video is more vivid and natural, and the use experience of a user is improved. And processing the recorded video into video frames as silent respiratory state video frames of the digital person. And circularly playing the silent respiratory video frames and the empty audio packets on the human-computer interaction interface under the condition that no user sends an interaction request. And under the condition that a user sends an interactive request, carrying out audio and video synthesis on a pre-recorded silent respiratory state video frame of the digital person, a first cache audio packet corresponding to the interactive feedback text content and a first cache video frame to obtain synthetic audio and video data of the digital person.
Optionally, in some embodiments of the present application, to reduce cache consumption, the first cached video frame may be a face region cached image. Synthesizing the face region cache image with a corresponding frame in the silent respiratory state video frame to obtain a synthesized video frame; and aligning the time stamps of the synthesized video frame and the first cache audio packet, and encoding an audio and video packet queue obtained after aligning the time stamps to obtain the synthesized audio and video data of the digital person.
And 105, playing the synthetic audio and video data of the digital person on the man-machine interaction interface so as to respond to the interaction request.
It should be noted that the human-computer interaction interface may be a human-computer interaction interface of an intelligent robot, a human-computer interaction interface on a mobile device, or a human-computer interaction interface on a device with an interaction function, which is not specifically limited in this application.
It should also be noted that, in the embodiment of the present application, steps 101 to 105 may be performed by a client. For example, the client may be an intelligent robot, or may also be another computer device with a human-computer interaction interface. And is not particularly limited herein.
Furthermore, in other embodiments of the present application, steps 101-104 may be performed by the server and step 105 may be performed by the client. For example, the server sends the generated synthetic audio and video data of the digital person to a client (such as an intelligent robot) so that the intelligent robot plays the synthetic audio and video data of the digital person on a human-computer interaction interface.
According to the digital human generation method, for the fixed telephone operation that the interactive feedback text content exists in the cache information database, the first cache audio packet and the first cache video frame which are cached in advance and correspond to the interactive feedback text content can be directly obtained, and real-time rendering generation is not needed. And carrying out audio and video synthesis on the first cache audio packet, the first cache video frame and the corresponding frame of the silent respiratory state video frame of the digital person to obtain synthetic audio and video data of the digital person, and playing the synthetic audio and video data of the digital person on a human-computer interaction interface so as to respond to the interaction application of the user, thereby improving the real-time performance of digital person interaction and reducing the waste of computing resources for synthesizing audio and video by adopting a real-time generation mode for a fixed telephone technology.
It should be noted that, under the condition that no user sends an interaction request, the silent respiratory state video frame is played on the human-computer interaction interface in a circulating manner, the user may interact with the digital person at any time, that is, the digital person may be switched from the non-interaction state to the interaction state at any time, and under the condition that the interaction feedback text content exists in the cache information base, the first cache audio packet and the first cache video frame corresponding to the interaction feedback text content, which are cached in advance, are obtained according to the interaction feedback text content. In order to enable the digital person to be naturally and smoothly switched to the interactive state from the non-interactive state, that is, a synthesized video frame obtained by synthesizing the first cached video frame and a corresponding frame in the silent respiratory state frame when the interaction occurs can be better connected with the silent respiratory state video frame in the digital person non-interactive state, a cached video frame sequence corresponding to the interactive feedback text content needs to be generated for each frame on the digital person silent respiratory state video frame. If a cached video frame corresponding to the interactive feedback text content is generated only for a certain frame on the silent respiratory state video frame, no matter which frame on the silent respiratory state video frame the digital person interacts with, the digital person needs to be synthesized and played with the same cached video frame sequence, so that the process of switching the digital person from the non-interactive state to the interactive state is inconsistent and smooth, and the user experience is influenced.
To explain this problem, an example is made here assuming that the first frame image and the second frame image of the silent respiratory state video frame are in a smiling state with the digital mouth being up. If the user sends an interaction request at the moment of the first frame of the silent respiratory state video frame, the digital person interacts the second frame in the silent respiratory state video frame, namely, the second frame image in the silent respiratory state video frame is synthesized with the first cache video frame corresponding to the interactive feedback text content in the cache information database to obtain a synthesized video frame, and the synthesized video frame is played on the human-computer interaction interface. The first cache video frame cached in the cache information database is a first cache video frame generated according to interactive feedback text content, a face region structure and face specific parameters based on a silent breathing state video frame when a digital person is not smiling. The first cached video frame presents that the digital person performs mouth shape corresponding to the interactive feedback text information with a smile-free expression. When the digital person switches from the non-interactive state to the interactive state, the effect presented to the user is as follows: and after playing a first frame image of a smiling expression of a digital person in a silent respiratory state video frame without a user interaction request, the human-computer interaction interface is switched to an interaction state, and the next frame plays a first cache video frame of the digital person without the smiling expression, so that a mouth shape which is not smiled and corresponds to the interaction feedback text content is made. Namely, the digital person is in a smiling silent state, and suddenly performs an interactive state without smile, which results in an unnatural and unsmooth process for the digital person to switch from the silent state to the interactive state.
In order to enable the digital person to naturally and smoothly switch from the silent state to the interactive state at any time, a cache video frame corresponding to the interactive feedback text content needs to be generated for each frame on the silent respiratory state video frame of the digital person. Assuming that the silent respiratory state video frame of the digital person has 200 frames of video frame images and the interactive feedback text information is "hello", a cache video frame corresponding to the interactive feedback text information "hello" needs to be generated for all 200 frames of silent respiratory state video frames, and the cache mode seriously consumes memory resources.
In order to solve the above problem, in some embodiments of the present application, a plurality of buffer insertion points may be set in a digital person silent breath state video frame when a first buffer video frame corresponding to an interactive feedback text message is generated, where each buffer insertion point is separated by a preset frame number. And generating a buffer video frame corresponding to each buffer insertion point aiming at the interactive feedback text content.
As an example, fig. 2 is a schematic diagram of generating a buffered video frame according to a second embodiment of the present application, and as shown in fig. 2, a plurality of buffer insertion points 201, 211, 221, 231, and 241 are set on a silent respiratory state video frame. As an example, the buffer insertion point may be set by a preset number of spaced frames, such as every tenth frame. And generating a cache video frame corresponding to each cache insertion point aiming at the same interactive feedback text content. That is, the buffer insertion point shown in fig. 2 is used as a starting frame, a buffer video frame corresponding to the interactive feedback text content is generated, and the buffer video frame is buffered. In addition, the cache audio packets corresponding to the interactive feedback text content need to be cached, wherein the cache video frames corresponding to each cache insertion point share one cache audio packet corresponding to the interactive feedback text content.
Based on the implementation manner of caching the first cached video frame, the embodiment of the present application provides a method for generating a digital person, where after a user sends an interaction request in a certain frame of the silent respiratory state video frames, the user waits for a certain frame time, and when a cache insertion point closest to the video frame sending the interaction request is reached, the first cached video frame of the interaction feedback text content corresponding to the cache insertion point is obtained, and is synthesized and played with a corresponding frame and a cached audio packet on the silent respiratory state video frame. As an example, fig. 3 is a schematic flowchart of a method for generating a digital person according to a third embodiment of the present application, and fig. 4 is a schematic diagram of a method for generating a digital person according to a fourth embodiment of the present application, where in order to save memory consumption, the first cached video frame in this embodiment may be a face region cached image. As shown in fig. 3, the digital human generation method provided in the third embodiment of the present application may include the following steps:
step 301, in response to the received interactive request, determining the interactive text content of the user according to the interactive request.
Step 302, determining the interactive feedback text content of the digital person according to the interactive text content.
And 303, in response to the fact that the interactive feedback text content exists in the cache information database, acquiring a cached image of the face region corresponding to the interactive feedback text content, which is cached in advance.
At step 304, a time at which the interactive request was received is determined.
Taking the embodiment shown in fig. 4 as an example, the digital person receives the interactive request at the time of the video frame 447 over the silent respiratory state video frame.
In step 305, a target buffer insertion point corresponding to the time when the interactive request is received is determined from a plurality of buffer insertion points on the silent respiratory state video frame.
Taking the embodiment shown in fig. 4 as an example, among the plurality of buffer insertion points 401, 411, 421, 431, 441 on the silent breathing state video frame, the buffer insertion point closest to the video frame 447 that receives the interactive request is the buffer insertion point 401, that is, the buffer insertion point 401 is determined as the target buffer insertion point. Wherein the time corresponding to the video frames 448, 449, 450 is the waiting time.
And step 306, acquiring a face region cache image of the corresponding interactive feedback text content according to the target cache insertion point.
Taking the embodiment shown in fig. 4 as an example, a face region cache image 460 of the interactive feedback text content corresponding to the target cache insertion point 401, which is cached in advance, is obtained according to the target cache insertion point 401.
And 307, determining a corresponding target video frame from the silent respiratory state video frames according to the target cache insertion point.
Taking the embodiment shown in fig. 4 as an example, the corresponding target video frames 401-410 are determined from the silence respiratory state video frames according to the target buffer insertion point 401.
And 308, combining the face region cache image with the target video frame to obtain a composite video frame.
Taking the embodiment shown in fig. 4 as an example, the face region cache image 460 is merged with the target video frame 401 and 410 to obtain a composite video frame. Fig. 5 is a schematic diagram of combining a face region cache image with a target video frame to obtain a composite video frame according to a fifth embodiment of the present disclosure.
And 319, aligning the time stamps of the synthesized video frame and the first cache audio packet, and encoding the audio and video packet queue obtained after the time stamps are aligned to obtain the synthesized audio and video data of the digital person.
And 310, playing the synthetic audio and video data of the digital person on the man-machine interaction interface so as to respond to the interaction request.
It should be noted that after the audio/video data is played on the human-computer interaction interface, the digital human silent respiratory state video frame continues to return to the video frame 411 on the digital human silent respiratory state video frame, and the digital human silent respiratory state video frame is switched to the non-interaction state, and the silent respiratory state video frame continues to be played.
According to the digital person generation method, although the user waits for a period of time after sending the interaction request, the memory resource of the cache video frame corresponding to the interaction feedback text content of each frame on the silent breathing state video frame of the digital person is saved. In addition, in practical applications, the waiting time is short (e.g. 400 ms), which is not noticeable to the user, so that the memory resource is saved and the practical use experience of the user is not affected.
In some embodiments of the present application, for interactive feedback text content that does not exist in the cache information database, the digital person's face region image and audio packet need to be rendered in real time. As an example, fig. 6 is a schematic flowchart of a digital person generation method provided in a sixth embodiment of the present application. As shown in fig. 6, the method for generating a digital person according to the sixth embodiment of the present application may include the following steps:
step 601, responding to the received interactive request, and determining the interactive text content of the user according to the interactive request.
Step 602, determining the interactive feedback text content of the digital person according to the interactive text content.
Step 603, determining whether the interactive feedback text content exists in the cache information database. If the interactive feedback text content exists in the cache information database, executing step 604; if the interactive feedback text content does not exist in the cache information database, step 611 is executed.
And step 604, obtaining a cached audio packet which is cached in advance and corresponds to the interactive feedback text content.
At step 605, a time when the interactive request is received is determined.
Step 606, determining a target buffer insertion point corresponding to the time when the interaction request is received from the plurality of buffer insertion points on the silence respiratory state video frame.
And step 607, acquiring a face region cache image of the corresponding interactive feedback text content according to the target cache insertion point.
Step 608, according to the target buffer insertion point, a corresponding target video frame is determined from the silent respiratory state video frames.
And step 609, merging the face region cache image with the target video frame to obtain a composite video frame.
And step 610, aligning the time stamps of the synthesized video frames and the cached audio packets, and encoding the audio and video packet queues obtained after the time stamps are aligned to obtain the synthesized audio and video data of the digital person.
Step 611, generate the target audio packet, phoneme, and timestamp of the interactive feedback text content.
Optionally, a TTS service may be invoked to generate target audio packets, phonemes, and timestamps for the interactive feedback text content based on the interactive feedback text content.
Step 612, generating corresponding expression parameters according to the phonemes and the time stamps, and generating a facial region structure according to the expression parameters and the specific parameters of the digital human face.
Step 613, generating a face region image of the digital person according to the face region structure.
And 614, combining the face area image with a pre-generated face area mask, and fusing the face area image with the character material in the silent respiratory state video frame to obtain a synthesized digital human video frame.
And step 615, synthesizing the target audio packet and the digital human video frame to obtain the synthetic audio and video data of the digital human.
At step 616, the synthesized audio and video data of the digital person is played on the man-machine interaction interface to respond to the interaction request.
It should be noted that, for the interactive feedback text content that does not exist in the cache information database, the digital person's face region image and audio packet need to be rendered in real time. If the generation times of some interactive feedback text contents are too many, a large amount of same interactive feedback texts need to be rendered and generated in real time, so that the waste of GPU computing resources is caused. Aiming at the problem, the interactive feedback text content which needs to be generated by real-time rendering can be judged, if the interactive feedback text content is high-frequency speech technology, a face region cache image sequence corresponding to each cache insertion point on the silent respiratory state video frame is generated according to the interactive feedback text content, and caching is carried out.
Optionally, in some embodiments of the present application, in response to the interactive feedback text content belonging to the high-frequency conversation technique, the interactive feedback text content is cached; generating a face region cache image sequence corresponding to each cache insertion point according to a plurality of cache insertion points on the silent respiratory state video frame aiming at the interactive feedback text content; caching the generated target audio packet and the facial region cache image sequence corresponding to each cache insertion point; wherein, the face region cache image sequences corresponding to the cache insertion points share a target audio packet of interactive feedback text content; caching the generated target audio packet and the facial region cache image sequence corresponding to each cache insertion point; and the face region cache image sequences corresponding to the cache insertion points share a target audio packet of interactive feedback text content.
It should be noted that the generation frequency of the same interactive feedback text content that does not exist in the cache information database within a certain time may be calculated, and whether the interactive feedback text content is a high-frequency speech technology is determined according to whether the generation frequency of the interactive feedback text content exceeds a preset threshold. If the generation times of the interactive feedback text content exceed a preset threshold, the interactive feedback text content can be determined to be a high-frequency conversation technology, and the interactive feedback text content needs to be cached in a cache information database.
That is, for interactive feedback text content belonging to high frequency speech technology, a face region buffer image sequence corresponding to a plurality of buffer insertion points on a silent respiratory state video frame, and a target audio packet of the interactive feedback text content may be generated. When the same interactive feedback text content needs to be fed back to the user next time, the face region cache image sequence and the target audio packet corresponding to the interactive feedback text content which is cached in advance can be directly obtained from the cache information database, and therefore computing resources generated by repeated rendering of high-frequency conversation are saved.
Referring to fig. 7, in step 611 to step 614, fig. 7 is a schematic flowchart of generating a digital human video frame according to a seventh embodiment of the present application.
In the embodiment of the present application, steps 601 to 610 may be implemented by any one of the manners in the embodiments of the present application, and this application is not specifically limited and is not described in detail herein.
In order to better understand the digital human generation method proposed in the embodiment of the present application, fig. 8 is an interaction diagram provided in the eighth embodiment of the present application. As shown in fig. 8, a user issues an interaction request (S801), and determines the interactive text content of the user according to the interaction request; and determining the interactive feedback text content of the digital person according to the interactive text content, and judging whether the interactive feedback text content exists in the cache information database (S802). If the interactive feedback text content does not exist in the cache information database, generating a video frame and a voice packet corresponding to the interactive text content by adopting a real-time rendering generation mode (S803) (S804), carrying out audio-video synthesis on the silent respiratory state video frame of the digital person and the video frame and the voice packet corresponding to the interactive text content generated by real-time rendering to obtain synthetic audio-video data of the digital person (S805), and playing the audio-video data on a human-computer interaction interface. Judging whether the interactive feedback text content which does not exist in the cache information database belongs to a high-frequency telephone operation, if the interactive feedback text content which does not exist in the cache information database belongs to the high-frequency telephone operation, caching the interactive feedback text content into the cache information database, generating video frames and voice packets corresponding to each cache insertion point in the silent respiratory state video frames of the digital people, and caching the video frames and the voice packets into the cache information database (S806); if the interactive feedback text content exists in the cache information database, acquiring a cache video frame and a cache audio packet of the corresponding interactive feedback text content according to a cache insertion point corresponding to the interactive request time (S807), performing audio-video synthesis on the silent respiratory state video frame of the digital person, the acquired cache video frame and the cache voice packet corresponding to the interactive text content to obtain synthetic audio-video data of the digital person (S805), and playing the audio-video data on a human-computer interaction interface.
According to the digital human generation method, for the fixed telephone operation with the interactive feedback text content in the cache information database, the cached image and the cached audio packet of the face area corresponding to the interactive feedback text content which are cached in advance can be directly obtained, real-time rendering generation is not needed, the real-time performance of digital human interaction is improved, and the waste of computing resources for synthesizing the audio and video by adopting a real-time generation mode for the fixed telephone operation is reduced. And for interactive feedback text content which does not exist in the cache information database, real-time rendering is carried out to generate a face region cache image and a cache audio packet of the digital person. In addition, the interactive feedback text content which does not exist in the cache information database is also judged to judge whether the interactive feedback text content belongs to the high frequency conversation technology. If the interactive feedback text content belongs to the high-frequency operation, generating an audio packet corresponding to the interactive feedback text content and a face area image corresponding to each cache insertion point, and caching the interactive feedback text content belonging to the high-frequency operation, the corresponding audio packet and the face area image corresponding to each cache insertion point, so that when the same interactive feedback text content needs to be fed back to the user next time, a face area cache image sequence and a target audio packet corresponding to the interactive feedback text content which are cached in advance can be directly obtained from a cache information database, and therefore computing resources generated by repeatedly rendering the high-frequency operation are saved.
It should be noted that, in some embodiments of the present application, the text content that the digital person needs to broadcast is often a text content that is a combination of a part of fixed-word content and a part of random-word content. For example, in a communication customer service scene, when a user inquires account balance, except that information such as money amount and time needs to be broadcasted, other text contents are fixed-line techniques. For example, the digital person needs to broadcast a text content of "respect user is good, and inquire that your current telephone charge balance is one thousand", wherein "respect user is good, inquire that your current telephone charge balance is" fixed telephony ", and" one thousand "is random telephony content. If the text content that the digital person needs to broadcast is all rendered in real time, the waste of computing resources can be caused.
Therefore, aiming at interactive feedback text content containing fixed phone content and random phone content, the embodiment of the application also provides a digital person generation method, which adopts a pre-cached cache video frame for the fixed phone existing in a cache information database and adopts a real-time rendering generation mode for the random phone content to generate a video frame corresponding to the random phone content. As an example, fig. 9 is a schematic flowchart of a method for generating a digital person according to a ninth embodiment of the present application. As shown in fig. 9, the digital human generation method provided in the ninth embodiment of the present application may further include the following steps:
Step 901, in response to the received interactive request, determining the interactive text content of the user according to the interactive request.
And step 902, determining the interactive feedback text content of the digital person according to the interactive text content.
Step 903, determining that the interactive feedback text content comprises fixed telephony content and random telephony content.
Step 904, in response to that the fixed-phone content in the interactive feedback text content exists in the cache information database, obtaining a second cache audio packet and a second cache video frame which are cached in advance and correspond to the fixed-phone content.
Step 905, generating a target audio packet, phonemes and timestamps of the random conversational content among the interactive feedback text content.
Step 906, rendering and generating the digital human content frame of the random phonetics content according to the phoneme and the time stamp of the random phonetics content.
And step 907, synthesizing the second cache audio packet, the second cache video frame, the target audio packet of the random conversational content, the random content frame of the digital person and the silent respiratory state video frame of the digital person to obtain the synthetic audio and video data of the digital person.
Optionally, in some embodiments of the present application, in order to make the mouth shape transition naturally and smoothly during the switching process of the fixed speech content and the random speech content, a transition frame between the fixed speech content and the random speech content may be generated. And synthesizing the transition frame, the second cache audio packet, the second cache video frame, the target audio packet of the random conversation content, the random content frame of the digital person and the silent respiratory state video frame of the digital person to obtain the synthetic audio and video data of the digital person.
As an example, fig. 10 is a schematic diagram of a transition frame between fixed telephony content and random telephony content provided in embodiment ten of the present application. As shown in fig. 10, in the interactive feedback text message "do your patron get well, inquire that your current telephone balance is one hundred fifty yuan, ask you for other questions? And the fixed-line telephone content is 'honorable customers are good, the current telephone charge balance is inquired' and 'ask you for other questions', and a second cache video frame corresponding to the fixed-line telephone content and cached in advance is obtained. The random dialog content is "one hundred fifty" and requires real-time rendering to generate digital human content frames. In order to naturally and smoothly switch the fixed-language content and the random-language content, a second buffer video frame corresponding to the last 1-2 characters of the fixed-language content before the random-language content can be used as a transition frame. In the embodiment of fig. 10, the transition frame of the "as" word may be input to the facial region rendering model as supervisory information, and the second buffered video frame of the "as" word may be made visually continuous with the digital human content frame of the first word "one" of the random conversational content. Directly switching to the second buffered video frame of fixed telephony content after generating the digital human content frame of random telephony content may also cause a discontinuity in the visual effect. Therefore, the first 1-2 characters of the fixed phone content after the random phone content are generated in a real-time rendering mode to generate corresponding transition frames. In the embodiment of fig. 10, the "ask" two words can be rendered in real time to generate the digital human content frame, and then the second cached video frame corresponding to the fixed-line-of-affairs content "how you have other questions" can be switched back to ensure the visual continuity.
And 908, playing the synthetic audio and video data of the digital person on the man-machine interaction interface so as to respond to the interaction request.
According to the digital human generation method, aiming at interactive feedback text content comprising fixed speech content and random speech content, a pre-cached video frame is adopted for the fixed speech existing in a cache information database, and a real-time rendering generation mode is adopted for the random speech content to generate a video frame corresponding to the random speech content. That is, the interactive feedback text content including the fixed phone content and the random phone content does not need to be rendered and generated in real time, and part of computing resources can be saved.
Fig. 11 is a block diagram of a digital human generation apparatus according to an eleventh embodiment of the present application. As shown in fig. 11, the digital human generating apparatus provided in this embodiment of the present application may include a first determining module 1101, a second determining module 1102, an obtaining module 1103, a first synthesizing module 1104, and a responding module 1105.
The first determining module 1101 is configured to, in response to the received interaction request, determine the interactive text content of the user according to the interaction request.
A second determining module 1102, configured to determine interactive feedback text content of the digital person according to the interactive text content.
The obtaining module 1103, in response to the existence of the interactive feedback text content in the cache information database, is configured to obtain a first cache audio packet and a first cache video frame that are cached in advance and correspond to the interactive feedback text content.
And the first synthesis module 1104 is configured to perform audio and video synthesis on the pre-recorded silent respiratory state video frame of the digital person, the first cache audio packet, and the first cache video frame to obtain synthesized audio and video data of the digital person.
A response module 1105, configured to play the synthesized audio and video data of the digital person on the man-machine interaction interface to respond to the interaction request.
Optionally, fig. 12 is a block diagram of a structure of a digital human generating apparatus according to a twelfth embodiment of the present application. As shown in fig. 12, the digital human generating apparatus provided in the embodiment of the present application may further include a first generating module 1206, a second generating module 1207, a third generating module 1208, a second synthesizing module 1209, and a third synthesizing module 1210.
The first generating module 1206 is used for generating a target audio packet, a phoneme and a time stamp of the interactive feedback text content in response to that the interactive feedback text content does not exist in the cache information database.
A second generating module 1207, configured to generate a timestamp according to the phoneme and the time parameter to generate a corresponding expression parameter, and form a facial region structure according to the expression parameter and a specific face of the digital person.
A third generating module 1208 for generating a face region image of the digital person from the face region structure.
And a second synthesizing module 1209, configured to combine the face region image with a pre-generated face region mask, and fuse the face region image with a person material in the silent respiratory state video frame, to obtain a synthesized digital person video frame.
And a third synthesizing module 1210, configured to synthesize the target audio packet and the digital person video frame to obtain synthesized audio and video data of the digital person.
Wherein 1201-1205 in fig. 12 and 1101-1105 in fig. 11 have the same functions and structures.
Optionally, fig. 13 is a block diagram of a digital human generating apparatus according to a thirteenth embodiment of the present application. As shown in fig. 13, the digital human generation apparatus provided in the embodiment of the present application may further include a first cache module 1311, a fourth generation module 1312, and a second cache module 1313.
The first caching module 1311, in response to the fact that the interactive feedback text content belongs to the high-frequency technology, is configured to cache the interactive feedback text content.
A fourth generating module 1312, configured to generate, according to multiple cache insertion points on the silence respiratory state video frame, a face region cache image sequence corresponding to each cache insertion point for the interactive feedback text content; and a preset frame number is arranged between every two cache insertion points.
A second buffer module 1313, configured to buffer the generated target audio packet and the face region buffer image sequence corresponding to each buffer insertion point; and the face region cache image sequences corresponding to the cache insertion points share a cache audio packet of interactive feedback text content.
Wherein, 1301-1310 in fig. 13 and 1201-1210 in fig. 12 have the same functions and structures.
Optionally, in some embodiments of the present application, the first buffered video frame is a face region buffered image; the first synthesis module 1304 is specifically configured to: synthesizing the face region cache image with a corresponding frame in the silent breathing state video frame to obtain a synthesized video frame; and aligning the time stamps of the synthesized video frame and the first cache audio packet, and encoding an audio and video packet queue obtained after aligning the time stamps to obtain the synthesized audio and video data of the digital person.
Optionally, in some embodiments of the present application, obtaining a cached video frame corresponding to interactive feedback text content that is cached in advance includes: determining the time when the interactive request is received; determining a target buffer insertion point corresponding to time from a plurality of buffer insertion points on a silent respiratory state video frame; obtaining a corresponding cache video frame of the interactive feedback text content according to the target cache insertion point;
Synthesizing the face region cache image with a corresponding frame in the silent respiratory state video frame to obtain a synthesized video frame, wherein the synthesizing comprises the following steps: determining a corresponding target video frame from the silent respiratory state video frames according to the target cache insertion point; and merging the face region cache image and the target video frame to obtain a composite video frame.
Optionally, in some embodiments of the present application, the obtaining module 1303 is specifically configured to: determining that the interactive feedback text content comprises fixed speech content and random speech content; responding to the fact that fixed-phone content in the interactive feedback text content exists in a cache information database, and obtaining a second cache audio packet and a second cache video frame which are cached in advance and correspond to the fixed-phone content; generating a target audio packet, phonemes and timestamps of the random conversational content among the interactive feedback text content; according to the phonemes and the time stamps of the random phonetics content, rendering and generating a digital human random content frame of the random phonetics content;
wherein the first synthesis module is specifically configured to: and synthesizing the second cache audio packet, the second cache video frame, the target audio packet of the random conversation content, the digital human random content frame and the digital human silent breathing state video frame to obtain the digital human synthetic audio and video data.
Optionally, in some embodiments of the present application, synthesizing the second buffered audio packet, the second buffered video frame, the target audio packet of the random conversational content, the digital human random content frame, and the silence breath state video frame of the digital human to obtain synthesized audio and video data of the digital human includes: generating a transition frame between the fixed-phone content and the random-phone content; and synthesizing the transition frame, the second cache audio packet, the second cache video frame, the target audio packet of the random conversation content, the random content frame of the digital person and the silent respiratory state video frame of the digital person to obtain the synthetic audio and video data of the digital person.
With regard to the apparatus in the above-described embodiment, the specific manner in which each module performs the operation has been described in detail in the embodiment related to the method, and will not be elaborated here.
According to the digital human generation device, for the fixed telephone operation with the interactive feedback text content in the cache information database, the cached image and the cached audio packet of the face area corresponding to the interactive feedback text content which are cached in advance can be directly obtained, real-time rendering generation is not needed, the real-time performance of digital human interaction is improved, and the waste of computing resources for synthesizing audio and video by adopting a real-time generation mode for the fixed telephone operation is reduced. For interactive feedback text content not existing in the cache information database, the digital person's facial region cache image and cache audio packet are generated by real-time rendering. In addition, the interactive feedback text content which does not exist in the cache information database is judged, and whether the interactive feedback text content belongs to the high frequency technology or not is judged. If the interactive feedback text content belongs to the high-frequency operation, generating an audio packet corresponding to the interactive feedback text content and facial area images corresponding to the cache insertion points, and caching the interactive feedback text content belonging to the high-frequency operation, the corresponding audio packet and the facial area images corresponding to the cache insertion points, so that a facial area cache image sequence and a target audio packet corresponding to the interactive feedback text content which is cached in advance can be directly obtained from a cache information database when the same interactive feedback text content needs to be fed back to the user next time, and therefore computing resources generated by repeatedly rendering the high-frequency operation are saved.
Based on the embodiment of the application, the application also provides computer equipment, at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the digital human generation method of any of the preceding embodiments.
It should be noted that, in the case that the digital human generation method provided by the present application is executed by a client, the computer device may be a device with an interaction function, such as an intelligent robot or a mobile device; in the case that the digital human generation method proposed in the present application is executed by a client and a server, the computer device may include a server and a device having an interactive function (e.g., an intelligent robot). As an example, when the computer device includes a server and a smart robot, the server transmits the generated synthetic audio and video data of the digital person to the smart robot so that the smart robot plays the synthetic audio and video data of the digital person on the human-computer interaction interface.
Based on the embodiment of the application, the application also provides a computer-readable storage medium, wherein the computer instructions are used for causing the computer to execute the digital human generation method according to any one of the previous embodiments provided by the embodiment of the application.
Fig. 14 is a block diagram of a computer device to implement the digital human generation method of the embodiment of the present application. Computer devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the present application that are described and/or claimed herein.
As shown in fig. 14, the computer apparatus 1400 includes a computing unit 1401 that can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM)1402 or a computer program loaded from a storage unit 1408 into a Random Access Memory (RAM) 1403. In the RAM 1403, various programs and data required for the operation of the device 1400 can also be stored. The calculation unit 1401, the ROM 1402, and the RAM 1403 are connected to each other via a bus 1404. An input/output (I/O) interface 1405 is also connected to bus 1404.
A number of components in computer device 1400 are connected to I/O interface 1405, including: an input unit 1406 such as a keyboard, a mouse, or the like; an output unit 1407 such as various types of displays, speakers, and the like; a storage unit 1408 such as a magnetic disk, optical disk, or the like; and a communication unit 1409 such as a network card, a modem, a wireless communication transceiver, and the like. The communication unit 1409 allows the device 1400 to exchange information/data with other devices via a computer network such as the internet and/or various telecommunication networks.
The computing unit 1401 may be a variety of general-purpose and/or special-purpose processing components having processing and computing capabilities. Some examples of the computing unit 1401 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, and the like. The computing unit 1401 performs the respective methods and processes described above, such as the digital human generation method. For example, in some embodiments, the digital human generation method may be implemented as a computer software program tangibly embodied in a machine-readable medium, such as storage unit 1408. In some embodiments, part or all of the computer program may be loaded and/or installed onto device 1400 via ROM 1402 and/or communication unit 1409. When the computer program is loaded into RAM 1403 and executed by the computing unit 1401, one or more steps of the digital person generation method described above may be performed. Alternatively, in other embodiments, the computing unit 1401 may be configured to perform the digital human generation method by any other suitable means (e.g. by means of firmware).
Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, Field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), system on a chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.
Program code for implementing the methods of the present application may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program codes, when executed by the processor or controller, cause the functions/operations specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.
In the context of this application, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.
The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), the internet, and blockchain networks.
The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The Server can be a cloud Server, also called a cloud computing Server or a cloud host, and is a host product in a cloud computing service system, so as to solve the defects of high management difficulty and weak service expansibility in the traditional physical host and VPS service ("Virtual Private Server", or simply "VPS"). The server may also be a server of a distributed system, or a server incorporating a blockchain.
It should be understood that various forms of the flows shown above, reordering, adding or deleting steps, may be used. For example, the steps described in the present disclosure may be executed in parallel, sequentially, or in different orders, and are not limited herein as long as the desired results of the technical solutions disclosed in the present disclosure can be achieved.
The above-described embodiments are not intended to limit the scope of the present disclosure. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present application shall be included in the protection scope of the present application.

Claims (16)

1. A method for digital human generation, comprising:
responding to the received interactive request, and determining interactive text content of the user according to the interactive request;
determining interactive feedback text content of the digital person according to the interactive text content;
responding to the interactive feedback text content existing in a cache information database, and acquiring a first cache audio packet and a first cache video frame which are cached in advance and correspond to the interactive feedback text content;
Carrying out audio and video synthesis on the pre-recorded silent respiratory state video frame of the digital person, the first cache audio packet and the first cache video frame to obtain synthetic audio and video data of the digital person;
and playing the synthesized audio and video data of the digital person on a human-computer interaction interface so as to respond to the interaction request.
2. The digital human generation method of claim 1, further comprising:
generating a target audio packet, a phoneme, and a timestamp of the interactive feedback text content in response to the interactive feedback text content not existing in the cache information database;
generating corresponding expression parameters according to the phonemes and the time stamps, and generating a facial region structure according to the expression parameters and specific parameters of the digital human face;
generating a face region image of the digital person from the face region structure;
combining the face area image with a pre-generated face area mask, and fusing the face area image with character materials in the silent respiratory state video frame to obtain a synthesized digital human video frame;
and synthesizing the target audio packet and the digital human video frame to obtain the synthesized audio and video data of the digital human.
3. The digital human generation method of claim 2, further comprising:
responding to the interactive feedback text content belonging to the high-frequency conversation technology, and caching the interactive feedback text content;
generating a face region cache image sequence corresponding to each cache insertion point according to a plurality of cache insertion points on a silent respiratory state video frame aiming at the interactive feedback text content; the interval between every two cache insertion points is preset frame number;
caching the generated target audio packet and the face region cache image sequence corresponding to each cache insertion point; and the face region cache image sequences corresponding to the cache insertion points share one target audio packet of the interactive feedback text content.
4. The digital human generation method of claim 1, wherein the first buffered video frame is a face region buffered image; the audio and video synthesis of the pre-recorded silent respiratory state video frame of the digital person, the first cache audio packet and the first cache video frame to obtain the synthetic audio and video data of the digital person comprises the following steps:
Synthesizing the face region cache image and a corresponding frame in the silent respiratory state video frame to obtain a synthesized video frame;
and aligning the time stamps of the synthesized video frame and the first cache audio packet, and encoding an audio and video packet queue obtained after aligning the time stamps to obtain the synthesized audio and video data of the digital person.
5. The digital human generation method of claim 4,
obtaining a cached video frame corresponding to the interactive feedback text content, which is cached in advance, and the method comprises the following steps:
determining a time at which the interactive request is received;
determining a target buffer insertion point corresponding to the time from a plurality of buffer insertion points on the silent respiratory state video frame;
obtaining a corresponding cache video frame of the interactive feedback text content according to the target cache insertion point;
the synthesizing the face region cache image and a corresponding frame in the silence breathing state video frame to obtain a synthesized video frame includes:
determining a corresponding target video frame from the silent respiratory state video frames according to the target cache insertion point;
and merging the face region cache image and the target video frame to obtain a composite video frame.
6. The method as claimed in claim 1, wherein the step of obtaining a first buffered audio packet and a first buffered video frame corresponding to the interactive feedback text content in response to the interactive feedback text content existing in a buffered information database comprises:
determining that the interactive feedback text content comprises fixed telephony content and random telephony content;
responding to the fixed phone content in the interactive feedback text content existing in the cache information database, and acquiring a pre-cached second cache audio packet and a second cache video frame corresponding to the fixed phone content;
generating a target audio packet, phonemes, and timestamps of the random conversational content among the interactive feedback text content;
according to the phonemes and the time stamps of the random conversational content, rendering and generating a digital human random content frame of the random conversational content;
the method for synthesizing audio and video of the pre-recorded silent respiratory state video frame of the digital person, the first cache audio packet and the first cache video frame to obtain synthesized audio and video data of the digital person comprises the following steps:
And synthesizing the second cache audio packet, the second cache video frame, the target audio packet of the random conversational content, the digital human random content frame and the silent respiratory state video frame of the digital human to obtain the synthetic audio and video data of the digital human.
7. The method of claim 6, wherein the synthesizing the second buffered audio packets, the second buffered video frames, the target audio packets of the random conversational content, the digital human random content frames, and the digital human silent breathing state video frames to obtain the synthesized audio-video data of the digital human comprises:
generating a transition frame between the fixed phone content and random phone content;
and synthesizing the transition frame, the second cache audio packet, the second cache video frame, the target audio packet of the random conversational content, the random content frame of the digital person and the silent respiratory state video frame of the digital person to obtain the synthetic audio and video data of the digital person.
8. A digital person generation apparatus, comprising:
the first determining module is used for responding to the received interactive request and determining the interactive text content of the user according to the interactive request;
The second determining module is used for determining interactive feedback text content of the digital person according to the interactive text content;
the acquisition module is used for responding to the existence of the interactive feedback text content in a cache information database and acquiring a first cache audio packet and a first cache video frame which are cached in advance and correspond to the interactive feedback text content;
the first synthesis module is used for carrying out audio and video synthesis on the pre-recorded silent respiratory state video frame of the digital person, the first cache audio packet and the first cache video frame to obtain synthesized audio and video data of the digital person;
and the response module is used for playing the synthetic audio and video data of the digital person on a human-computer interaction interface so as to respond to the interaction request.
9. The digital human generation apparatus of claim 8, further comprising:
a first generating module, responsive to the interactive feedback text content not existing in the cache information database, for generating a target audio packet, a phoneme, and a timestamp of the interactive feedback text content;
the second generation module is used for generating corresponding expression parameters according to the phonemes and the time stamps and generating a facial region structure according to the expression parameters and specific parameters of the digital human face;
A third generating module for generating a facial region image of the digital person from the facial region structure;
the second synthesis module is used for combining the facial region image with a pre-generated facial region mask and fusing the facial region image with the figure material in the silent respiratory state video frame to obtain a synthesized digital person video frame;
and the third synthesis module is used for synthesizing the target audio packet and the digital human video frame to obtain the synthesized audio and video data of the digital human.
10. The digital human generation apparatus of claim 9, further comprising:
the first cache module is used for caching the interactive feedback text content in response to the fact that the interactive feedback text content belongs to the high-frequency speech technology;
a fourth generating module, configured to generate, according to multiple cache insertion points on a silent respiratory state video frame, a face region cache image sequence corresponding to each cache insertion point for the interactive feedback text content; the interval between every two cache insertion points is preset frame number;
the second cache module is used for caching the generated target audio packet and the facial region cache image sequence corresponding to each cache insertion point; and the face region cache image sequences corresponding to the cache insertion points share one cache audio packet of the interactive feedback text content.
11. The digital human generation apparatus of claim 8, wherein the first buffered video frame is a face region buffered image; the first synthesis module is specifically configured to:
synthesizing the face region cache image and a corresponding frame in the silent respiratory state video frame to obtain a synthesized video frame;
and aligning the time stamps of the synthesized video frame and the first cache audio packet, and coding an audio and video packet queue obtained after the time stamps are aligned to obtain the synthesized audio and video data of the digital person.
12. The digital human generation apparatus of claim 11,
obtaining a cached video frame which is cached in advance and corresponds to the interactive feedback text content, wherein the cached video frame comprises:
determining a time at which the interactive request is received;
determining a target cache insertion point corresponding to the time from a plurality of cache insertion points on the silent respiratory state video frame;
obtaining a corresponding cache video frame of the interactive feedback text content according to the target cache insertion point;
the synthesizing the face region cache image and the corresponding frame in the silent respiratory state video frame to obtain a synthesized video frame includes:
Determining a corresponding target video frame from the silent respiratory state video frames according to the target cache insertion point;
and merging the face region cache image and the target video frame to obtain a composite video frame.
13. The digital human generation apparatus of claim 8, wherein the acquisition module is specifically configured to:
determining that the interactive feedback text content comprises fixed telephony content and random telephony content;
responding to the fixed telephony content in the interactive feedback text content to exist in the cache information database, and acquiring a second cache audio packet and a second cache video frame which are cached in advance and correspond to the fixed telephony content;
generating a target audio packet, phonemes, and timestamps of the random conversational content among the interactive feedback textual content;
according to the phonemes and the time stamps of the random conversation contents, rendering and generating digital human random content frames of the random conversation contents;
wherein the first synthesis module is specifically configured to:
and synthesizing the second cache audio packet, the second cache video frame, the target audio packet of the random conversation content, the digital human random content frame and the digital human silent breathing state video frame to obtain the synthetic audio and video data of the digital human.
14. The digital person generation apparatus according to claim 13, wherein said synthesizing the second buffered audio packets, the second buffered video frames, the target audio packets of the random telephony content, the digital person random content frames, and the digital person silence breath state video frames to obtain the synthesized audio and video data of the digital person comprises:
generating a transition frame between the fixed phone content and random phone content;
and synthesizing the transition frame, the second cache audio packet, the second cache video frame, the target audio packet of the random conversation content, the digital human random content frame and the silent breathing state video frame of the digital human to obtain the synthetic audio and video data of the digital human.
15. A computer device, comprising:
at least one processor; and
a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,
the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the digital human generation method of any one of claims 1 to 7.
16. A computer-readable storage medium, wherein the computer instructions are configured to cause the computer to perform the digital human generation method of any one of claims 1 to 7.
CN202210285154.4A 2022-03-21 2022-03-21 Digital human generation method, device, computer equipment and storage medium Pending CN114760425A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210285154.4A CN114760425A (en) 2022-03-21 2022-03-21 Digital human generation method, device, computer equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210285154.4A CN114760425A (en) 2022-03-21 2022-03-21 Digital human generation method, device, computer equipment and storage medium

Publications (1)

Publication Number Publication Date
CN114760425A true CN114760425A (en) 2022-07-15

Family

ID=82327598

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210285154.4A Pending CN114760425A (en) 2022-03-21 2022-03-21 Digital human generation method, device, computer equipment and storage medium

Country Status (1)

Country Link
CN (1) CN114760425A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116248812A (en) * 2023-05-11 2023-06-09 广州佰锐网络科技有限公司 Business handling method, storage medium and system based on digital human interaction video

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116248812A (en) * 2023-05-11 2023-06-09 广州佰锐网络科技有限公司 Business handling method, storage medium and system based on digital human interaction video
CN116248812B (en) * 2023-05-11 2023-08-08 广州佰锐网络科技有限公司 Business handling method, storage medium and system based on digital human interaction video

Similar Documents

Publication Publication Date Title
CN107423809B (en) Virtual robot multi-mode interaction method and system applied to video live broadcast platform
CN110446000B (en) Method and device for generating dialogue figure image
EP1269465B1 (en) Character animation
CN112543342B (en) Virtual video live broadcast processing method and device, storage medium and electronic equipment
CN112650831A (en) Virtual image generation method and device, storage medium and electronic equipment
CN112100352A (en) Method, device, client and storage medium for interacting with virtual object
CN114895817B (en) Interactive information processing method, network model training method and device
CN112148850A (en) Dynamic interaction method, server, electronic device and storage medium
KR20220011083A (en) Information processing method, device, electronic equipment and storage medium in user dialogue
JP2023059937A (en) Data interaction method and device, electronic apparatus, storage medium and program
CN114760425A (en) Digital human generation method, device, computer equipment and storage medium
CN113706669B (en) Animation synthesis method and device, electronic equipment and storage medium
CN114610158A (en) Data processing method and device, electronic equipment and storage medium
CN113611316A (en) Man-machine interaction method, device, equipment and storage medium
CN114255737B (en) Voice generation method and device and electronic equipment
EP4152269A1 (en) Method and apparatus of generating 3d video, method and apparatus of training model, device, and medium
CN115460323A (en) Method, device, equipment and storage medium for intelligent external call transfer
CN114898018A (en) Animation generation method and device for digital object, electronic equipment and storage medium
CN113744368A (en) Animation synthesis method and device, electronic equipment and storage medium
CN113379879A (en) Interaction method, device, equipment, storage medium and computer program product
CN112148849A (en) Dynamic interaction method, server, electronic device and storage medium
CN115619923B (en) Rendering method and device for virtual object, electronic equipment and storage medium
CN114422849A (en) Video generation method and device, electronic equipment and storage medium
CN113744370B (en) Animation synthesis method, animation synthesis device, electronic device, and storage medium
CN117369681A (en) Digital human image interaction system based on artificial intelligence technology

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination