CN116168134A - Digital person control method, digital person control device, electronic equipment and storage medium - Google Patents

Digital person control method, digital person control device, electronic equipment and storage medium Download PDF

Info

Publication number
CN116168134A
CN116168134A CN202211697797.6A CN202211697797A CN116168134A CN 116168134 A CN116168134 A CN 116168134A CN 202211697797 A CN202211697797 A CN 202211697797A CN 116168134 A CN116168134 A CN 116168134A
Authority
CN
China
Prior art keywords
target
digital person
text
clause
broadcasted
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202211697797.6A
Other languages
Chinese (zh)
Other versions
CN116168134B (en
Inventor
李鑫
刘朋
孙昊
付钰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Baidu Netcom Science and Technology Co Ltd
Original Assignee
Beijing Baidu Netcom Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Baidu Netcom Science and Technology Co Ltd filed Critical Beijing Baidu Netcom Science and Technology Co Ltd
Priority to CN202211697797.6A priority Critical patent/CN116168134B/en
Publication of CN116168134A publication Critical patent/CN116168134A/en
Application granted granted Critical
Publication of CN116168134B publication Critical patent/CN116168134B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T15/003D [Three Dimensional] image rendering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T13/00Animation
    • G06T13/203D [Three Dimensional] animation
    • G06T13/403D [Three Dimensional] animation of characters, e.g. humans, animals or virtual beings

Abstract

The disclosure provides a control method, a device, electronic equipment and a storage medium for a digital person, relates to the technical field of computers, in particular to the technical field of artificial intelligence, and specifically relates to the technical fields of augmented reality, virtual reality, computer vision, deep learning and the like, and can be applied to scenes such as metauniverse, virtual digital person and the like. The specific implementation scheme is as follows: acquiring a target clause from a text to be broadcasted; extracting sentence characteristics of a target clause; determining target materials matched with sentence characteristics; generating control information containing the corresponding relation between the target clause and the target material; and sending the control information to the terminal equipment so that the terminal equipment renders the target digital person and synchronously renders the target material under the condition that the target digital person is driven to broadcast the target clause. According to the scheme provided by the embodiment of the disclosure, the proper materials can be automatically matched so as to drive the digital person, and the digital person can be rendered by better receiving the control information under the condition of poor network conditions, so that the situation that the digital person is stuck and broadcasting is not smooth is reduced.

Description

Digital person control method, digital person control device, electronic equipment and storage medium
Technical Field
The disclosure relates to the technical field of computers, in particular to the technical field of artificial intelligence, and specifically relates to the technical fields of augmented reality, virtual reality, computer vision, deep learning and the like, and can be applied to scenes such as metauniverse, virtual digital people and the like.
Background
A digital person is understood to be a digitized character created using digital technology that approximates a human character. As digital man technology matures, digital man has been applied in various fields. The most important application is virtual anchor and virtual even.
The content broadcasted by the digital person needs to be pre-configured, and the digital person is driven to speak according to the content. However, with the development of the service mode, the existing configuration mode cannot meet the application requirements.
Disclosure of Invention
The disclosure provides a digital person control method, a digital person control device, electronic equipment and a storage medium.
According to an aspect of the present disclosure, there is provided a control method of a digital person, including:
acquiring a target clause from a text to be broadcasted;
extracting sentence characteristics of a target clause;
determining target materials matched with sentence characteristics;
generating control information containing the corresponding relation between the target clause and the target material;
And sending the control information to the terminal equipment so as to enable the terminal equipment to render the target digital person and synchronously render the target material under the condition that the target digital person is driven to broadcast the target clause.
According to another aspect of the present disclosure, there is provided a control method of a digital person, including:
receiving control information for rendering a target digital person;
analyzing the corresponding relation between the target clause and the target material in the text to be broadcasted from the control information;
and rendering the target digital person based on the control information, and synchronously rendering the target material under the condition that the target digital person broadcasts the target clause based on the corresponding relation.
According to another aspect of the present disclosure, there is provided a control device for a digital person, including:
the acquisition module is used for acquiring a target clause from a text to be broadcasted;
the extraction module is used for extracting sentence characteristics of the target clause;
the matching module is used for determining target materials matched with the sentence characteristics;
the generation module is used for generating control information containing the corresponding relation between the target clause and the target material;
and the sending module is used for sending the control information to the terminal equipment so as to enable the terminal equipment to render the target digital person and synchronously render the target materials under the condition that the target digital person is driven to broadcast the target clause.
According to another aspect of the present disclosure, there is provided a control device for a digital person, including:
the receiving module is used for receiving control information for rendering the target digital person;
the analysis module is used for analyzing the corresponding relation between the target clause and the target material in the text to be broadcasted from the control information;
and the rendering module is used for rendering the target digital person based on the control information and synchronously rendering the target materials under the condition that the target digital person broadcasts the target clause based on the corresponding relation.
According to another aspect of the present disclosure, there is provided an electronic device including:
at least one processor; and
a memory communicatively coupled to the at least one processor; wherein, the liquid crystal display device comprises a liquid crystal display device,
the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of the embodiments of the present disclosure.
According to another aspect of the present disclosure, there is provided a non-transitory computer-readable storage medium storing computer instructions for causing the computer to perform a method according to any one of the embodiments of the present disclosure.
According to another aspect of the present disclosure, there is provided a computer program product comprising a computer program which, when executed by a processor, implements a method according to any of the embodiments of the present disclosure.
In the embodiment of the disclosure, based on a natural language processing technology, the characteristics of the target sentences in the text to be broadcasted are understood so as to be automatically matched with the corresponding target materials based on the characteristics, so that proper materials can be automatically matched, and the digital person can be driven conveniently. In addition, by sending the control information to the terminal device instead of sending the video stream to the terminal device, the data volume of interaction between the terminal device and the server can be reduced, and under the condition of poor network conditions, the control information can be better received to render the digital person, so that the situation that the digital person is stuck and broadcasting is not smooth is reduced.
It should be understood that the description in this section is not intended to identify key or critical features of the embodiments of the disclosure, nor is it intended to be used to limit the scope of the disclosure. Other features of the present disclosure will become apparent from the following specification.
Drawings
The drawings are for a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:
FIG. 1 (a) is a schematic view of a scenario in which a digital human control method according to an embodiment of the present disclosure is applied;
FIG. 1 (b) is a flow chart of a digital human control method according to an embodiment of the present disclosure;
FIG. 2 (a) is a schematic diagram of a scenario for obtaining a target clause according to an embodiment of the present disclosure;
FIG. 2 (b) is a schematic view of a scenario for obtaining a target clause according to another embodiment of the present disclosure;
FIG. 3 is a schematic view of a scenario of acquiring target material according to an embodiment of the present disclosure;
FIG. 4 is a flow diagram of switching control information of a target digital person according to an embodiment of the present disclosure;
FIG. 5 is a schematic diagram of a scenario in which control information of a target digital person is switched according to an embodiment of the present disclosure;
FIG. 6 is a flow chart diagram of a digital human control method according to another embodiment of the present disclosure;
FIG. 7 is a schematic view of a scene showing target material detail information based on floating windows according to an embodiment of the present disclosure;
FIG. 8 is an overall flow diagram of a drive target digital person according to an embodiment of the present disclosure;
FIG. 9 is a schematic structural view of a digital human control device according to an embodiment of the present disclosure;
fig. 10 is a schematic structural view of a digital human control device according to another embodiment of the present disclosure;
fig. 11 is a block diagram of an electronic device for implementing a method of controlling a digital person according to an embodiment of the present disclosure.
Detailed Description
Exemplary embodiments of the present disclosure are described below in conjunction with the accompanying drawings, which include various details of the embodiments of the present disclosure to facilitate understanding, and should be considered as merely exemplary. Accordingly, one of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.
The material that drives the digital person generally includes two types, one of which is text or audio so that the digital person can "speak" like a person, and the other of which is a picture, including a broadcast background in which the digital person is located, pictures, videos, etc. related to the broadcast content.
The materials are usually manually configured, so that the early-stage operation required to be executed by a driving digital person is complicated, and the efficiency is low. In addition, in the related art, based on the configured materials, rendering is completed at the server, the video stream obtained by rendering is sent to the terminal equipment for playing, and under the condition of poor network conditions, a digital person is easy to clip, and the playing is not smooth.
In view of this, the embodiment of the disclosure provides a method for controlling a digital person, and fig. 1a is a schematic view of a scenario in which the method is applied. Fig. 1a includes a server 11 and a terminal device 12.
The terminal device 12 and the server 11 are connected through a wireless or wired network, and the terminal device 12 includes, but is not limited to, electronic devices such as a desktop computer, a mobile phone, a mobile computer, a tablet computer, a media player, an intelligent wearable device, and an intelligent television. The server 11 may be a server, a server cluster formed by a plurality of servers, or a cloud computing center. The server 11 may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, CDNs (Content Delivery Network, content delivery networks), basic cloud computing services such as big data and artificial intelligence platforms, and the like.
In the embodiment of the disclosure, the support server 11 automatically completes the configuration of the material, reduces manual operation steps, and transmits control information of the digital person to the terminal device 12 through the network so as to drive the digital person.
For the server, the control method for the digital person provided by the embodiment of the disclosure, as shown in fig. 1b, includes:
s101, acquiring a target clause from a text to be broadcasted.
In some embodiments, determining the content of the text to be broadcasted may take a variety of manners, for example, for collecting the target voice, inputting the target voice into a voice recognition system, and obtaining the text corresponding to the target voice as the text to be broadcasted; for another example, text information may be acquired as text to be broadcasted.
S102, extracting sentence characteristics of the target clause.
In some embodiments, sentence feature extraction may be performed on the target clause based on a pre-trained language model of transfomer (Bidirectional Encoder Representations from Transformer, BERT). Of course, network models capable of extracting sentence features based on natural language processing techniques are applicable to the embodiments of the present disclosure.
S103, determining target materials matched with the sentence characteristics.
S104, generating control information containing the corresponding relation between the target clause and the target material.
S105, the control information is sent to the terminal equipment, so that the terminal equipment renders the target digital person, and the target material is synchronously rendered under the condition that the target digital person is driven to broadcast the target clause.
Wherein the transmission of the control information to the terminal device may be implemented based on Websocket technology, or an SDK (Software Development Kit, software tool development kit) may be installed at the terminal device so as to receive the control information through the SDK. The present disclosure is not limited in this regard.
In the embodiment of the disclosure, the characteristics of the target clause in the text to be broadcasted can be understood through a natural language processing technology so as to be automatically matched with the corresponding target material based on the characteristics, so that the proper material can be automatically matched, and the digital person can be driven conveniently. The links of manually configuring materials for the text can be reduced or even omitted, and the efficiency of digital personnel control is improved. In addition, by sending the control information to the terminal device instead of sending the video stream to the terminal device, the data volume of interaction between the terminal device and the server can be reduced, and under the condition of poor network conditions, the control information can be better received to render the digital person, so that the situation that the digital person is stuck and broadcasting is not smooth is reduced.
In some embodiments, the following ways may be provided to obtain the target clause from the text to be broadcasted, including:
the way 1) in which the target clause is obtained, in which way a target mark, for example a start position and/or an end position of the target clause, can be added to the target clause of the text to be broadcasted. Therefore, the target mark can be inquired from the text to be broadcasted; the clause with the target mark is determined as the target clause.
In the case where only the start position is marked, one clause may be read back from the start position as the target clause. In the case where only the end position is marked, one clause may be read forward from the end position as the target clause. In the case of marking the start position and the end position, a sentence between the start position and the end position may be read as a target clause.
The target mark can be marked manually by a user on the text to be broadcasted, and different clauses can be cut through a natural language processing technology.
In addition to the implementation of marking the start position and the end position provided in the above embodiments, the target clause may be marked by means of natural language analysis. For example, the text to be broadcasted is an article, and the target mark is double underlined, that is, the sentence with double underlines is the target clause. Entity relation extraction can be performed on the text to be broadcasted. If the word of the physical class is detected, it can be marked. Wherein the words of the real object class can be forest, river water, delphinidia, mountain and delphinidia, etc. As shown in fig. 2 (a), two places, "light green forest". Singing … … "and" mountain of natural color ". Clapping … …" are clauses with target marks.
Therefore, in the embodiment of the disclosure, the target clause is automatically and accurately acquired in the text to be broadcasted in a marked mode, so that a data basis is provided for the subsequent automatic matching materials.
A mode 2) of acquiring the target clause, wherein the text content in the text to be broadcasted can be displayed in the mode; responding to the selection operation of the text content, and splitting the selected text content into clauses; and determining the clause obtained by splitting as a target clause.
For example, in the background configuration interface, an operator may view text content to be broadcasted, and may freely select the text content, as shown in fig. 2 (b), and the text in the dashed box is the text content selected by the operator. Under the condition that the selected text content contains a plurality of clauses, splitting the selected text content, further obtaining a plurality of clauses, and determining each clause as a target clause. In the case that the selected text contains only one clause, the clause is directly determined as the target clause without splitting the selected text content. The text content can be split by adopting a neural network to predict each character category so as to predict the probability that each character belongs to the starting position of the clause, the probability that each character belongs to the ending position of the clause and the probability that each character belongs to the intermediate character of the clause. Therefore, the starting position and the ending position of each clause can be positioned, and the automatic division of each clause is realized.
In the embodiment of the disclosure, the user operation is matched with the function of automatically splitting clauses, so that the efficiency of material configuration can be improved. Moreover, the text content is selected by the user independently, and the self-adaption capability of the configuration mode is strong.
In some embodiments, the target clause may also be determined based on a combination of the two approaches described above. For example, based on the method of querying the target mark, an initial clause is obtained from the text to be broadcasted, and in the background configuration interface, an operator can select again based on the obtained initial clause, and then perform sentence segmentation to obtain the target clause.
After the target clause is acquired, the target digital person is driven in order to be able to match to the appropriate target material. In an embodiment of the present disclosure, determining the target material matching the sentence feature may be based on the following manner, including:
1) And screening candidate materials matched with the sentence characteristics from a material library, and determining the candidate materials as target materials matched with the sentence characteristics under the condition of matching the candidate materials.
That is, a material library can be pre-constructed, and when in implementation, different texts to be broadcasted can correspond to different material libraries, so that accuracy of matching target materials is improved.
2) In order to enrich the content of the digital person as much as possible, under the condition that the candidate material is not matched, the sentence characteristics are input into a material generation network, and the target material matched with the sentence characteristics is obtained.
For example, a pre-trained generative countermeasure network (Generative Adversarial Networks, GAN) may be used as the material generation network, and the GAN network is used to generate materials matching the sentence characteristics.
In the embodiment of the disclosure, the target material with higher matching degree can be obtained based on the sentence characteristic obtaining mode, and the accuracy of material matching is improved. Meanwhile, under the condition that candidate materials are not matched in the material library, the target materials can be generated based on the material generation network, the matched materials can be generated to enrich the content, and the rendering effect of the digital person is improved.
Wherein the target material may include at least one of: animation materials, text materials, picture materials, sound effect materials, three-dimensional models, background materials and the like. The abundant material content can make the expression form supported by the digital person more flexible, and can improve the business capability of the digital person so as to meet the business requirements of different scenes.
In some embodiments, after the sentence features of the target clause are obtained, filtering is performed in the material library based on the sentence features. The screening interface is shown in fig. 3, after the screening result is obtained, the candidate material with the highest matching degree can be amplified, and placed in a sample frame of a preset target material, and an operator can determine whether to determine the candidate material as the target material. Therefore, the accuracy of the matching target materials can be improved in an auditing mode.
When the method is implemented, under the condition that the matching degree of a plurality of candidate materials and clause features is high, candidate materials larger than a matching degree threshold value can be screened and displayed for selection by a user. Wherein the matching degree threshold value can be 80%. When the method is used for displaying, the plurality of candidate materials can be displayed in a sequence based on the matching degree.
In addition, the number of candidate materials to be displayed may be set in addition to the manner in which the matching degree threshold is set. For example, the number threshold is set to 5. And under the condition that the number N of the matched candidate materials is larger than 5, acquiring the first 5 candidate materials for selection by a user. Wherein N is a positive integer.
In other embodiments, in order to accurately match the target material, in the case where the number of N is large, the user is allowed to perform secondary search on N candidate materials to find a suitable target material. As shown in fig. 3, the corresponding target clause may be presented to the left to facilitate the user's understanding of the configurable material of the target clause. In the case of multiple candidate materials, the N candidate materials may be screened again based on the keywords. For example, keywords may be complementary to the color, shape, and some details associated with the target clause. The material type can be selected based on the material type, and the material type can be an animation material, a text material, a picture material, an audio material, a three-dimensional model and the like. The input method of the filtering condition can be voice input as shown in fig. 3 or text input as shown in fig. 3. The further screened candidate materials can be displayed, one of the candidate materials is selected by a user, and the selected candidate material can be previewed by clicking a preview control. The user selects one candidate material, and the selected candidate material can be determined to be the target material of the target clause through the "confirm" control shown in fig. 3. In the case of a wrong selection, the corresponding relationship between the target clause and the target material can be revoked through a cancel control shown in fig. 3, and the reestablishment of the corresponding relationship is supported.
In some embodiments, the matched target material may also be an audio effect, for example, by performing semantic understanding on the target clause, a scene related to the target clause may be obtained. For example, the target clause may add sound effects to set aside the atmosphere when describing natural landscapes such as fallen leaves, rain, rivers, etc.
Taking the target clause to describe the fallen leaves as an example, the method not only can match image materials related to the fallen leaves, including videos, pictures, three-dimensional models and the like, but also can match background sound effects suitable for describing the fallen leaves.
In practice, in the case where the target clause corresponds to a frame of audio in the audio, the selected image material is aligned with the frame of audio and the selected sound effect is added before and after the frame of audio. So that the sound effect can be played in advance when the target clause is broadcast, and the sound effect can be continuously played after the target clause is broadcast.
In some embodiments, where no candidate material is matched, sentence features may be input into the material generation network, resulting in target material that matches the sentence features, as set forth above. For example, the type of target material required for the target clause is picture material. Thus, the material generation network may use a GAN network. The GAN network is a deep learning model that is built by at least two modules in a framework: a generating module (generating Model) and a discriminating module (Discriminative Model) can generate a quality output based on the mutual game learning of the two modules. Based on the method, sentence characteristics can be input into a generation model in the GAN network, generated candidate pictures are obtained, and the candidate pictures are used as target materials of the target clauses under the condition that the candidate pictures are matched with the target clauses.
Under the condition that candidate materials matched with the target clause are not obtained, a SimBERT model can be used for generating a similar text of the target clause based on the target clause, the similar text is input into a GAN network, the generated candidate picture is obtained, and then candidate materials matched with the target clause are screened from the candidate picture, and the candidate materials are used as target materials of the target clause. Therefore, the content of the candidate materials can be expanded through the similar clauses of the target clause so as to select the proper target materials from the candidate materials.
The operation mode of matching the target clause with the target material can be completed before the target digital person performs live broadcast, and can also be intervened in the live broadcast process of the target digital person so as to flexibly configure the content broadcasted by the digital person.
In some embodiments, the target material may also include clothing and accessories worn by the target digital person. In the live broadcast process, clothes and accessories of a target digital person can be switched based on contents of broadcast characters, so that good watching experience of a user is guaranteed.
In some embodiments, the target material also includes a background material where the target digital person is located, where the background material may be a two-dimensional picture in reality, and may be a three-dimensional model. In the live broadcasting process of the digital person, under the condition that the background needs to be switched based on the broadcasted text content, candidate backgrounds can be obtained based on the mode of obtaining the target materials, and the target backgrounds are determined based on the matching degree.
In some embodiments, in order to improve smoothness of digital person playing, in the embodiments of the present disclosure, a corresponding relationship between a target clause and a target material includes a download address of the target material, and the terminal device may download the target material and render based on the download address. The method does not occupy more server video memory, and can better save the cost of the server.
In some embodiments, in order to support the terminal device to render the target digital person, the control information further includes facial expression coefficients and limb motion coefficients of the target digital person. In practice, facial expression coefficients and limb motion coefficients may be automatically generated. For example, the following steps may be included:
and A1, inputting the audio of the text to be broadcasted into a voice animation synthesis network to obtain facial expression coefficients corresponding to each frame of audio respectively.
Step A2, carrying out semantic analysis on each frame of audio of the text to be broadcasted to obtain a semantic analysis result; and determining limb action coefficients corresponding to semantic analysis results of each frame of audio.
It should be noted that the execution sequence of the step A1 and the step A2 is not limited.
The audio of the Text to be broadcasted is generated through a Speech synthesis network (TTS), the format of the audio can be a wav file, and if the audio or the Text to be broadcasted is input into a Voice-to-Animation (VTA) synthesis network, a facial expression coefficient of a target digital person corresponding to the audio is automatically generated, which may also be called a blendscape coefficient, and accurate driving of the mouth shape and facial expression of the target digital person can be realized based on the blendscape coefficient.
Carrying out semantic analysis on each frame of audio of the text to be broadcasted to obtain a semantic analysis result, further determining limb actions corresponding to the audio based on the semantic analysis result, and realizing alignment operation of the text to be broadcasted and the corresponding audio, facial expression coefficients and limb action coefficients by an automatic process. And carrying the information of the text to be broadcasted, the corresponding audio, the facial expression coefficient and the limb action coefficient in the control information, and sending the control information to the terminal equipment for driving the digital person.
In the embodiment of the disclosure, accurate driving of mouth shapes and facial expressions of a target digital person can be automatically realized based on the VTA algorithm. Meanwhile, the limb action coefficient of the target digital person is obtained based on a semantic understanding algorithm, and the two modes are combined to accurately control the target digital person to make an action matched with the text to be broadcasted, so that the digital person is more natural.
Besides automatically generating facial expression coefficients and limb action coefficients based on the text to be broadcasted, the driving mode of the target digital person can be switched according to requirements. For example, the switching of the control information of the target digital person may also be implemented based on the following manner, as shown in fig. 4:
S401, responding to the switching instruction, stopping acquiring the facial expression coefficients from the voice animation synthesis network, and stopping acquiring the limb action coefficients based on the semantic analysis result.
That is, in the case of receiving the switching instruction, the operations of the foregoing steps A1 and A2 may be stopped, and the switching may be made to the driving of the target digital person in the manner of the target object.
S402, capturing facial motion of a target object for driving the target digital person to obtain a facial expression coefficient of the target object.
S403, capturing the limb motion of the target object to obtain the limb motion coefficient of the target object.
S404, carrying the facial expression coefficient and the limb action coefficient of the target object in the control information and sending the control information to the terminal equipment.
Wherein the target object can be a person who can be understood as a person who operates the virtual host to conduct live broadcast.
It can be known that the user can send a switching request according to the requirement, so as to change the driving mode of the virtual digital person.
In the embodiment of the disclosure, the target digital person can be driven by the target object, namely the person in the target object, based on the switching instruction, and the switching mode can be flexibly applied to various scenes and has stronger adaptability.
In some embodiments, an interactive scene can exist in a live broadcast process, and the target digital person can be live broadcast for 24 hours without interruption based on the facial expression coefficient and the limb action coefficient which are automatically generated by the text to be broadcast. Interaction links, such as question answering links, can be regularly generated in the live broadcast process. Because the randomness in the question-answering link is strong, the condition that the target digital person cannot intervene possibly occurs, in order to avoid the occurrence of the condition, a switching instruction can be sent to the target digital person at a designated time point, and the driving mode is switched to the driving mode of the target digital person at the designated time point, so that the facial expression coefficient of the target object is acquired based on the facial motion capturing mode; and acquiring a limb motion coefficient of the target object based on a limb motion capturing mode, generating control information based on the facial expression coefficient and the limb motion coefficient of the target object, and sending the control information to the terminal equipment so that the terminal equipment finishes rendering the target digital person.
In the embodiment, the digital person is flexible to use, the working time of the target object can be arranged according to the requirement, and the driving cost caused by driving the target digital person by the target object is reduced.
In some embodiments, in a case where the target object ends driving the target digital person, switching to automatically drive the target digital person by the text to be broadcasted may be performed based on the switching instruction. The switching instruction is used for stopping acquiring the facial expression coefficient of the target object based on the facial motion capturing mode, stopping acquiring the limb motion coefficient of the target object based on the limb motion capturing mode, acquiring the facial expression coefficient based on the voice animation synthesis network instead, acquiring the limb motion coefficient based on the semantic analysis result, carrying the subsequently acquired facial expression coefficient and the limb motion coefficient in control information, and transmitting the control information to the terminal equipment so that the terminal equipment can finish rendering the target digital person.
In some embodiments, in order to naturally switch the driving mode, to avoid situations of jamming and unsmooth movements of the target digital person as much as possible, the switching driving mode for the target digital person in the embodiments of the present disclosure may be implemented as follows: prompting the target object to start driving the target digital person from the preset action, and under the condition that the target digital person executes the preset action, starting to execute the step of capturing the facial action of the target object to obtain the facial expression coefficient of the target object.
In the embodiment of the disclosure, the driving of the target object to the digital person is completed based on the same action, so that the action of the digital person is more natural, the user watching experience is smoother, and the rendering effect of the digital person can be improved.
For example, the motion of the double-arm sagging may be determined as a preset motion, the target digital person is prompted to start preparation within a preset time period before the target digital person performs the double-arm sagging motion, and the steps S401 to S404 are started to be performed at a time point when the target digital person is detected to perform the double-arm sagging motion, the driving data (including the facial expression coefficient and the limb motion coefficient) of the target object is obtained, and the driving is switched to the driving of the target digital person using the target object.
In addition to performing seamless engagement of the driving action by means of a preset action, in other possible embodiments, it is also possible to: responding to a switching instruction, and determining a switching position in a text to be broadcasted; prompting the broadcasting progress of the current broadcasting content of the target digital person to the switching position to the target object; in the case where the target digital person reports to the switching position, the step of stopping the acquisition of the facial expression coefficients from the voice animation synthesis is started to be performed.
For example, as shown in fig. 5, the background configuration interface of this manner shows the specific content of the text to be broadcasted, the gray underlined font part in the figure is driven by the VTA and semantic understanding to the target digital person, the black font part without gray underlines is driven by the target digital person, the switching position in the text to be broadcasted is shown in fig. 5, and for each character before the switching position, after the target digital person broadcasts the character, a prompt is performed, so that, as the broadcasting process proceeds, the gray progress bar represents the dynamic display of the broadcasted character, so as to be used for viewing the distance approaching the switching position, the target object can clearly understand the broadcasting progress, and perform the action to drive the target digital person at a proper time. For example, fig. 5 reports a "social" word showing that the "social" word has been reported, reports a "meeting" word showing that the "meeting" word has been reported. And so on, the target object can be helped to grasp the rhythm of broadcasting.
Wherein the switching position can also be freely specified by the target object. The position where one object is broadcasted can also be determined as the switching position.
In the embodiment of the disclosure, the setting of the switching position and the broadcasting progress can remind the target object to prepare in advance, and under the condition that the broadcasting progress reaches the switching position, the target object has sufficient time to prepare so as to achieve the state of seamless connection for switching, so that the watching experience of a user is ensured.
In other embodiments, switching may also be accomplished based on the departure and entry animations. For example, at the end of the time point when the digital person is automatically driven based on the text to be broadcasted, inserting the scene to render, and after the scene is broadcasted, obtaining the driving data of the target object to drive the target digital person.
In the foregoing, a method for controlling a digital person by a server is described, and based on the same technical concept, an embodiment of the present disclosure provides a method for controlling a digital person applicable to a terminal device, which may be implemented as shown in fig. 6:
s601 receives control information for rendering a target digital person.
S602, analyzing the corresponding relation between the target clause and the target material in the text to be broadcasted from the control information.
And S603, rendering the target digital person based on the control information, and synchronously rendering the target material under the condition that the target digital person broadcasts the target clause based on the corresponding relation.
In the embodiment of the disclosure, the target digital person and the target material are rendered based on the terminal equipment, so that network bandwidth and network speed can be effectively saved, and the smoothness of the broadcasting picture can be improved under the condition of weaker network conditions based on the scheme provided by the disclosure.
In some embodiments, in order to accurately control the target digital person, the control information further includes a facial expression coefficient and a limb action coefficient of the target digital person, and audio of the text to be broadcasted, so as to drive the target digital person to broadcast the text to be broadcasted. Accordingly, rendering the target digital person based on the control information may be implemented as: controlling a facial expression of the target digital person based on the facial expression coefficient; controlling limb movements of the target digital person based on the limb movement coefficients; and broadcasting the text to be broadcasted by the digital person based on the audio control target, and displaying the text to be broadcasted.
In the embodiment of the disclosure, the target digital person is driven based on the facial expression coefficient and the limb action coefficient in the control information, the audio of the text to be broadcasted and the text to be broadcasted, so that the digital person can be driven more accurately, and the viewing experience of the user in the live broadcasting process of the digital person can be improved.
According to the embodiment of the disclosure, the rendering of the target digital person and the target material can be realized based on the graphic rendering interface and the rendering engine on the terminal equipment. For example, rendering may be performed based on terminal devices using 3D drawing protocols (Web Graphics Library, webGL), multimedia programming interfaces (DirectX), and graphics programming interfaces (Open Graphics Library, openGL), and three. Rendering to a target digital person is primarily dependent on graphics rendering techniques.
In some embodiments, taking three.js as an example of rendering a 3D model, one target digital person corresponds to one 3D model, each target digital person is associated with a canvas tag in a graphics renderer, and when the target digital person is used, based on the tag corresponding to the target digital person, rendering of the target digital person can be implemented in the graphics renderer based on the tag. The target digital person used for rendering the text to be broadcasted can be specified in the control information by an operator. After selecting which target digital person to use, transmitting control information carrying facial expression coefficients and limb action coefficients, a text to be broadcasted and audio of the text to be broadcasted to terminal equipment, and rendering the target digital person by the terminal equipment based on three.
In some embodiments, where the target material comprises a three-dimensional model, a rendering perspective of the three-dimensional model may be determined in response to a user operation on the three-dimensional model; rendering the three-dimensional model based on the rendering perspective.
In the embodiment of the disclosure, the method for determining the rendering view angle for the three-dimensional model based on the user operation greatly improves the viewing experience of the user and enables the user to conveniently view the three-dimensional model.
Taking a live scene of selling goods as an example, the three-dimensional model in the scene can be the goods to be sold. When the target digital person sells the A commodity, the A commodity can be rendered into a user live broadcast interface as the target material, and the user can perform self-defining operation on the A commodity. The user-defined operation can be to zoom in, zoom out, rotate and the like on the product A, so that the user can check the commodity in detail. Under the condition that the terminal equipment is a computer end, a user can perform self-defining operation on commodities based on a mouse; under the condition that the terminal equipment is a mobile phone end, a user can perform self-defining operation on commodities based on a touch screen mode.
In a live scene, if the accessory is sold, the accessory can be configured on the target digital person. For example, the target digital person at this time may configure an earring at the earlobe of the target digital person at a point of time when the article B (earring) is taught. The user can view details of the earrings during the explanation of the target digital person.
In addition, the user can be allowed to upload own photos, the user is subjected to three-dimensional modeling, and then the accessories are fused into the three-dimensional model of the user, so that the user can preview the effect of wearing the accessories.
In addition, the user is allowed to select the live broadcast scene of the target digital person and the clothes and apparel of the target digital person according to the preference of the user, and the tone of the target digital person can be selected, so that different display modes of the same live broadcast content can be realized, and the personalized requirements of different users are met.
In some embodiments, under the condition of rendering the target material, in order to facilitate the user to know the details of the target material according to the self requirements, the detailed information of the target material can be acquired in response to the selection operation of the rendered target material; and displaying the detailed information of the target material in the floating window.
In some embodiments, the example is in a live scenario with merchandise sold. As shown in fig. 7, in the process of introducing the commodity C by the target digital person, the user may click on the commodity C and then view detailed information of the commodity C based on the form of the floating window. The floating window can be adjusted in position by a user, and the detailed information of the floating window can comprise a plurality of sub options, wherein the sub options can be a three-dimensional model of the commodity C, a purchase link, a main purpose and the like. After clicking the three-dimensional model, the three-dimensional model of the commodity C appears in the floating window, and a user can zoom in, zoom out and rotate the three-dimensional model to enter the three-dimensional model for viewing and the like. Wherein which content is specifically presented for the detail information may be determined based on user operations supported in the floating window.
In some embodiments, the method can also be applied to live scenes introducing sceneries, and in the case that a target digital person introduces a sceneries, the target material can be in the form of pictures. In the live broadcast process, in the process of introducing the scenery A by the target digital person, the user wants to view more contents of the scenery A, can click on materials related to the scenery A, obtains detailed information of the scenery A based on floating windows, and for example, sub options of the detailed information can be an address where the scenery A is located, scenery points similar to the scenery A, video materials of the scenery A and the like.
In the embodiment of the disclosure, based on the mode of displaying the detail information of the target material, the user can quickly browse the detail information of the target material, and the viewing experience of the user is improved.
In some embodiments, in the case that the task of the target digital person is selling a building, the background of the task may be the layout of a house, the target digital person may navigate in a three-dimensional model of the house, introduce each room detail of the house, and the user may manipulate the target digital person to walk around the house, so that the user may completely understand the layout of the house. The detail information of the room can be checked, and the place where the house is located and facilities around the house can be checked.
In some embodiments, where the task of the target digital person is to introduce a school, the background may be a partial area of the school, such as a first teaching building, dormitory building, first dining room, examination room, or the like. When the target digital person is introducing the first teaching building, the user can check that the first teaching building is built in a certain year based on the floating window, the first teaching building is used for teaching the students of which professions, whether the first teaching building can learn by oneself, and the like. When the target digital person is introducing a dormitory building, the user can view the dormitory layout, the size of the bed, etc. based on the floating window. It is also known from this which college student is living in the dormitory building. By the method, the user can know each area of the school and the corresponding purpose of the area in more detail.
In some embodiments, in the event that a holiday is encountered during live broadcast, the animation material associated with the holiday may be retrieved based on the holiday and rendered to the terminal device. For example, in the case of living broadcast during the festival of the spring festival, some red lantern, antithetical couplets and some materials with the atmosphere of the spring festival may be screened. But for the user's viewing experience, the holiday material may be self-adjusting in material position by the user, as well as turning the material off.
The overall flow of driving a target digital person in a VTA and semantic understanding manner according to the embodiments of the present disclosure is shown in fig. 8, and may be implemented as follows:
s801, obtaining a text to be broadcasted.
S802, inputting the text to be broadcasted into the TTS to obtain the audio corresponding to the text to be broadcasted.
S803, inputting the audio into the VTA to acquire the facial expression coefficients.
S804, semantic understanding is carried out based on the text to be broadcasted, and limb action coefficients are obtained.
S805, obtaining a target clause from the text to be broadcasted.
S806, extracting sentence characteristics of the target clause.
S807, determining target material matching the sentence characteristics.
S808, generating a corresponding relation containing the target clause and the target material.
S809, the audio, the facial expression coefficients, the limb action coefficients and the corresponding relations are aligned, and control information is generated.
The alignment mode can be to construct an array, and the audio frequency, the facial expression coefficient and the limb action coefficient of the same frame are stored in the same array. Even in the case of managed target materials, the corresponding relationship between the target clause corresponding to the audio and the target materials can be stored in the same array, wherein the corresponding relationship can be in the form of a download address of the target materials.
And S810, transmitting control information comprising the text to be broadcasted, the audio, the facial expression coefficient, the limb action coefficient and the corresponding relation to the terminal equipment.
S811, the terminal equipment analyzes the corresponding relation from the control information, and the facial expression coefficient, the limb action coefficient, the audio and the text to be broadcasted are analyzed.
And S812, rendering the target digital person based on the control information, and synchronously rendering the target material under the condition that the target digital person broadcasts the target clause based on the corresponding relation.
Specifically, based on the facial expression coefficient, the limb action coefficient realizes driving the action of the target digital person; and acquiring target materials based on the corresponding relation, and displaying the text content of the target clause based on the text to be broadcasted and synchronously rendering the target materials corresponding to the target clause under the condition that the target digital person broadcasts the target clause of the text to be broadcasted based on the audio.
Based on the same technical concept, the present disclosure also provides a control device for a digital person, including, as shown in fig. 9:
the acquisition module 901 is used for acquiring a target clause from a text to be broadcasted;
an extracting module 902, configured to extract sentence characteristics of a target clause;
a matching module 903, configured to determine a target material that matches the sentence feature;
A generating module 904, configured to generate control information including a correspondence between a target clause and a target material;
and a sending module 905, configured to send control information to the terminal device, so that the terminal device renders the target digital person, and synchronously renders the target material when the target digital person is driven to broadcast the target clause.
In some embodiments, the obtaining module is specifically configured to:
inquiring a target mark from a text to be broadcasted;
the clause with the target mark is determined as the target clause.
In some embodiments, the obtaining module is specifically configured to:
displaying the text content of the text to be broadcasted;
responding to the selection operation of the text content, and splitting the selected text content into clauses;
and determining the clause obtained by splitting as a target clause.
In some embodiments, the target material includes at least one of: animation material, text material, picture material, sound effect material, three-dimensional model and background material.
In some embodiments, the matching module is specifically configured to:
screening candidate materials matched with sentence characteristics from a material library;
under the condition of matching to the candidate materials, determining the candidate materials as target materials matched with sentence characteristics;
Under the condition that the candidate materials are not matched, inputting sentence characteristics into a material generation network to obtain target materials matched with the sentence characteristics.
In some embodiments, the control information further includes facial expression coefficients and limb motion coefficients of the target digital person, and the apparatus further includes:
a coefficient determination module for generating facial expression coefficients and limb motion coefficients based on the following method:
inputting the audio of the text to be broadcasted into a voice animation synthesis network to obtain facial expression coefficients corresponding to each frame of audio respectively;
carrying out semantic analysis on each frame of audio of the text to be broadcasted to obtain a semantic analysis result;
and determining limb action coefficients corresponding to semantic analysis results of each frame of audio.
In some embodiments, the coefficient determination module is further to:
responding to the switching instruction, stopping acquiring the facial expression coefficients from the voice animation synthesis network, and stopping acquiring the limb action coefficients based on the semantic analysis result;
capturing facial motion of a target object for driving the target digital person to obtain a facial expression coefficient of the target object; the method comprises the steps of,
capturing limb actions of the target object to obtain a limb action coefficient of the target object;
And carrying the facial expression coefficient and the limb action coefficient of the target object in the control information and sending the control information to the terminal equipment. .
In some embodiments, the coefficient determination module is further to:
prompting the target object to start driving the target digital person from the preset action, and under the condition that the target digital person executes the preset action, starting to execute the step of capturing the facial action of the target object for driving the target digital person to obtain the facial expression coefficient of the target object.
In some embodiments, the coefficient determination module is further to:
responding to a switching instruction, and determining a switching position in a text to be broadcasted;
prompting the broadcasting progress of the current broadcasting content of the target digital person to the switching position to the target object;
in the case where the target digital person reports to the switching position, the step of stopping acquiring the facial expression coefficients from the voice animation synthesis network is started to be performed.
In some embodiments, the corresponding relationship further includes a download address of the target material.
Based on the same technical concept, the present disclosure also provides a control device for a digital person, including, as shown in fig. 10:
a receiving module 1001 for receiving control information for rendering a target digital person;
The parsing module 1002 is configured to parse a corresponding relationship between a target clause and a target material in a text to be broadcasted from the control information;
the rendering module 1003 is configured to render the target digital person based on the control information, and synchronously render the target material based on the correspondence when the target digital person broadcasts the target clause.
In some embodiments, the control information further includes a facial expression coefficient and a limb action coefficient of the target digital person, a text to be broadcasted, and audio of the text to be broadcasted, and a rendering module specifically configured to:
controlling a facial expression of the target digital person based on the facial expression coefficient;
controlling limb movements of the target digital person based on the limb movement coefficients;
and broadcasting the text to be broadcasted by the digital person based on the audio control target, and displaying the text to be broadcasted.
In some embodiments, the rendering module is further to:
in the case that the target material comprises a three-dimensional model, determining a rendering perspective of the three-dimensional model in response to a user operation on the three-dimensional model;
rendering the three-dimensional model based on the rendering perspective.
In some embodiments, the rendering module is further to:
under the condition of rendering the target material, responding to the selection operation of the rendered target material, and acquiring detailed information of the target material;
And displaying the detailed information of the target material in the floating window.
For descriptions of specific functions and examples of each module and sub-module of the apparatus in the embodiments of the present disclosure, reference may be made to the related descriptions of corresponding steps in the foregoing method embodiments, which are not repeated herein.
In the technical scheme of the disclosure, the acquisition, storage, application and the like of the related user personal information all conform to the regulations of related laws and regulations, and the public sequence is not violated.
According to embodiments of the present disclosure, the present disclosure also provides an electronic device, a readable storage medium and a computer program product.
Fig. 11 illustrates a schematic block diagram of an example electronic device 1100 that can be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile apparatuses, such as personal digital assistants, cellular telephones, smartphones, wearable devices, and other similar computing apparatuses. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the disclosure described and/or claimed herein.
As shown in fig. 11, the apparatus 1100 includes a computing unit 1101 that can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM) 1102 or a computer program loaded from a storage unit 1108 into a Random Access Memory (RAM) 1103. In the RAM 1103, various programs and data required for the operation of the device 1100 can also be stored. The computing unit 1101, ROM 1102, and RAM 1103 are connected to each other by a bus 1104. An input/output (I/O) interface 1105 is also connected to bus 1104.
Various components in device 1100 are connected to I/O interface 1105, including: an input unit 1106 such as a keyboard, a mouse, etc.; an output unit 1107 such as various types of displays, speakers, and the like; a storage unit 1108, such as a magnetic disk, optical disk, etc.; and a communication unit 1109 such as a network card, modem, wireless communication transceiver, or the like. The communication unit 1109 allows the device 1100 to exchange information/data with other devices through a computer network such as the internet and/or various telecommunication networks.
The computing unit 1101 may be a variety of general purpose and/or special purpose processing components having processing and computing capabilities. Some examples of computing unit 1101 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, etc. The calculation unit 1101 performs the respective methods and processes described above, for example, a digital person control method. For example, in some embodiments, the digital person control method may be implemented as a computer software program tangibly embodied on a machine-readable medium, such as storage unit 1108. In some embodiments, some or all of the computer programs may be loaded and/or installed onto device 1100 via ROM 1102 and/or communication unit 1109. When a computer program is loaded into the RAM 1103 and executed by the computing unit 1101, one or more steps of the digital person control method described above may be performed. Alternatively, in other embodiments, the computing unit 1101 may be configured to perform the digital person's control method by any other suitable means (e.g. by means of firmware).
Various implementations of the systems and techniques described here above can be implemented in digital electronic circuitry, integrated circuit systems, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), systems On Chip (SOCs), complex Programmable Logic Devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs, the one or more computer programs may be executed and/or interpreted on a programmable system including at least one programmable processor, which may be a special purpose or general-purpose programmable processor, that may receive data and instructions from, and transmit data and instructions to, a storage system, at least one input device, and at least one output device.
Program code for carrying out methods of the present disclosure may be written in any combination of one or more programming languages. These program code may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus such that the program code, when executed by the processor or controller, causes the functions/operations specified in the flowchart and/or block diagram to be implemented. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.
In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and pointing device (e.g., a mouse or trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic input, speech input, or tactile input.
The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), and the internet.
The computer system may include a terminal device and a server. The terminal device and the server are typically remote from each other and typically interact through a communication network. The relationship of terminal equipment and server arises by virtue of computer programs running on the respective computers and having a terminal equipment-server relationship to each other. The server may be a cloud server, a server of a distributed system, or a server incorporating a blockchain.
It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps recited in the present disclosure may be performed in parallel, sequentially, or in a different order, provided that the desired results of the disclosed aspects are achieved, and are not limited herein.
The above detailed description should not be taken as limiting the scope of the present disclosure. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives are possible, depending on design requirements and other factors. Any modifications, equivalent substitutions, improvements, etc. that are within the principles of the present disclosure are intended to be included within the scope of the present disclosure.

Claims (31)

1. A method of controlling a digital person, comprising:
acquiring a target clause from a text to be broadcasted;
extracting sentence characteristics of the target clause;
determining target materials matched with the sentence characteristics;
generating control information containing the corresponding relation between the target clause and the target material;
and sending the control information to terminal equipment so that the terminal equipment renders a target digital person and synchronously renders the target material under the condition that the target digital person is driven to report the target clause.
2. The method of claim 1, wherein the obtaining the target clause from the text to be broadcasted comprises:
inquiring a target mark from the text to be broadcasted;
and determining the clause with the target mark as the target clause.
3. The method of claim 1, wherein the obtaining the target clause from the text to be broadcasted comprises:
displaying the text content of the text to be broadcasted;
responding to the selection operation of the text content, and splitting the selected text content into clauses;
and determining the clause obtained by splitting as the target clause.
4. A method according to any one of claims 1-3, wherein the target material comprises at least one of:
animation material, text material, picture material, sound effect material, three-dimensional model and background material.
5. The method of any of claims 1-4, wherein the determining target material that matches the sentence feature comprises:
screening candidate materials matched with the sentence characteristics from a material library;
under the condition of matching to candidate materials, determining the candidate materials as target materials matched with the sentence characteristics;
And under the condition that the candidate materials are not matched, inputting the sentence characteristics into a material generation network to obtain target materials matched with the sentence characteristics.
6. The method of any of claims 1-5, further comprising in the control information facial expression coefficients and limb motion coefficients of the target digital person, the method further comprising generating the facial expression coefficients and limb motion coefficients based on:
inputting the audio of the text to be broadcasted into a voice animation synthesis network to obtain facial expression coefficients corresponding to each frame of audio respectively;
carrying out semantic analysis on each frame of audio of the text to be broadcasted to obtain a semantic analysis result;
and determining limb action coefficients corresponding to semantic analysis results of each frame of audio.
7. The method of claim 6, the method further comprising:
responding to a switching instruction, stopping acquiring the facial expression coefficients from the voice animation synthesis network, and stopping acquiring the limb action coefficients based on a semantic analysis result;
capturing facial motion of a target object for driving the target digital person to obtain a facial expression coefficient of the target object; the method comprises the steps of,
Capturing limb actions of the target object to obtain a limb action coefficient of the target object;
and carrying the facial expression coefficient and the limb action coefficient of the target object in the control information and sending the control information to the terminal equipment.
8. The method of claim 7, the method comprising:
prompting the target object to start driving the target digital person from a preset action, and under the condition that the target digital person executes the preset action, starting to execute the step of capturing the facial action of the target object for driving the target digital person to obtain the facial expression coefficient of the target object.
9. The method of claim 7, the method further comprising:
responding to the switching instruction, and determining a switching position in the text to be broadcasted;
prompting the broadcasting progress of the current broadcasting content of the target digital person to the switching position to the target object;
and starting to execute the step of stopping acquiring the facial expression coefficients from the voice animation synthesis network under the condition that the target digital person reports to the switching position.
10. The method according to any one of claims 1-9, wherein the correspondence further includes a download address of the target material.
11. A method of controlling a digital person, comprising:
receiving control information for rendering a target digital person;
analyzing the corresponding relation between the target clause and the target material in the text to be broadcasted from the control information;
and rendering the target digital person based on the control information, and synchronously rendering the target material based on the corresponding relation under the condition that the target digital person reports the target clause.
12. The method of claim 11, further comprising facial expression coefficients and limb motion coefficients of the target digital person, the text to be announced, and audio of the text to be announced, the rendering the target digital person based on the control information comprising:
controlling a facial expression of the target digital person based on the facial expression coefficient;
controlling limb movements of the target digital person based on the limb movement coefficients;
and controlling the target digital person to broadcast the text to be broadcasted based on the audio, and displaying the text to be broadcasted.
13. The method of claim 11 or 12, the method further comprising:
in a case where the target material includes a three-dimensional model, determining a rendering perspective of the three-dimensional model in response to a user operation on the three-dimensional model;
Rendering the three-dimensional model based on the rendering perspective.
14. The method of any one of claims 11-13, the method further comprising:
under the condition of rendering the target material, responding to the selection operation of the rendered target material, and acquiring detailed information of the target material;
and displaying the detailed information of the target material in the floating window.
15. A digital human control device comprising:
the acquisition module is used for acquiring a target clause from a text to be broadcasted;
the extraction module is used for extracting sentence characteristics of the target clause;
the matching module is used for determining target materials matched with the sentence characteristics;
the generation module is used for generating control information containing the corresponding relation between the target clause and the target material;
and the sending module is used for sending the control information to the terminal equipment so as to enable the terminal equipment to render the target digital person and synchronously render the target material under the condition of driving the target digital person to report the target clause.
16. The apparatus of claim 15, wherein the obtaining module is specifically configured to:
inquiring a target mark from the text to be broadcasted;
And determining the clause with the target mark as the target clause.
17. The apparatus of claim 15, wherein the obtaining module is specifically configured to:
displaying the text content of the text to be broadcasted;
responding to the selection operation of the text content, and splitting the selected text content into clauses;
and determining the clause obtained by splitting as the target clause.
18. The apparatus of any of claims 15-17, wherein the target material comprises at least one of:
animation material, text material, picture material, sound effect material, three-dimensional model and background material.
19. The apparatus according to any of claims 15-18, wherein the matching module is specifically configured to:
screening candidate materials matched with the sentence characteristics from a material library;
under the condition of matching to candidate materials, determining the candidate materials as target materials matched with the sentence characteristics;
and under the condition that the candidate materials are not matched, inputting the sentence characteristics into a material generation network to obtain target materials matched with the sentence characteristics.
20. The apparatus of any of claims 15-19, further comprising facial expression coefficients and limb motion coefficients of the target digital person in the control information, the apparatus further comprising:
A coefficient determination module for generating the facial expression coefficients and the limb-motion coefficients based on the following method:
inputting the audio of the text to be broadcasted into a voice animation synthesis network to obtain facial expression coefficients corresponding to each frame of audio respectively;
carrying out semantic analysis on each frame of audio of the text to be broadcasted to obtain a semantic analysis result;
and determining limb action coefficients corresponding to semantic analysis results of each frame of audio.
21. The apparatus of claim 20, the coefficient determination module further to:
responding to a switching instruction, stopping acquiring the facial expression coefficients from the voice animation synthesis network, and stopping acquiring the limb action coefficients based on a semantic analysis result;
capturing facial motion of a target object for driving the target digital person to obtain a facial expression coefficient of the target object; the method comprises the steps of,
capturing limb actions of the target object to obtain a limb action coefficient of the target object;
and carrying the facial expression coefficient and the limb action coefficient of the target object in the control information and sending the control information to the terminal equipment.
22. The apparatus of claim 21, the coefficient determination module further to:
Prompting the target object to start driving the target digital person from a preset action, and under the condition that the target digital person executes the preset action, starting to execute the step of capturing the facial action of the target object for driving the target digital person to obtain the facial expression coefficient of the target object.
23. The apparatus of claim 21, the coefficient determination module further to:
responding to the switching instruction, and determining a switching position in the text to be broadcasted;
prompting the broadcasting progress of the current broadcasting content of the target digital person to the switching position to the target object;
and starting to execute the step of stopping acquiring the facial expression coefficients from the voice animation synthesis network under the condition that the target digital person reports to the switching position.
24. The apparatus according to any one of claims 15-23, wherein the correspondence further includes a download address of the target material.
25. A digital human control device comprising:
the receiving module is used for receiving control information for rendering the target digital person;
the analysis module is used for analyzing the corresponding relation between the target clause and the target material in the text to be broadcasted from the control information;
And the rendering module is used for rendering the target digital person based on the control information and synchronously rendering the target material under the condition that the target digital person reports the target clause based on the corresponding relation.
26. The apparatus of claim 25, wherein the control information further includes facial expression coefficients and limb motion coefficients of the target digital person, the text to be broadcasted, and audio of the text to be broadcasted, and the rendering module is specifically configured to:
controlling a facial expression of the target digital person based on the facial expression coefficient;
controlling limb movements of the target digital person based on the limb movement coefficients;
and controlling the target digital person to broadcast the text to be broadcasted based on the audio, and displaying the text to be broadcasted.
27. The apparatus of claim 25 or 26, the rendering module further to:
in a case where the target material includes a three-dimensional model, determining a rendering perspective of the three-dimensional model in response to a user operation on the three-dimensional model;
rendering the three-dimensional model based on the rendering perspective.
28. The apparatus of any of claims 25-27, the rendering module further to:
Under the condition of rendering the target material, responding to the selection operation of the rendered target material, and acquiring detailed information of the target material;
and displaying the detailed information of the target material in the floating window.
29. An electronic device, comprising:
at least one processor; and
a memory communicatively coupled to the at least one processor; wherein, the liquid crystal display device comprises a liquid crystal display device,
the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-14.
30. A non-transitory computer readable storage medium storing computer instructions for causing the computer to perform the method of any one of claims 1-14.
31. A computer program product comprising a computer program which, when executed by a processor, implements the method according to any of claims 1-14.
CN202211697797.6A 2022-12-28 2022-12-28 Digital person control method, digital person control device, electronic equipment and storage medium Active CN116168134B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211697797.6A CN116168134B (en) 2022-12-28 2022-12-28 Digital person control method, digital person control device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211697797.6A CN116168134B (en) 2022-12-28 2022-12-28 Digital person control method, digital person control device, electronic equipment and storage medium

Publications (2)

Publication Number Publication Date
CN116168134A true CN116168134A (en) 2023-05-26
CN116168134B CN116168134B (en) 2024-01-02

Family

ID=86419260

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211697797.6A Active CN116168134B (en) 2022-12-28 2022-12-28 Digital person control method, digital person control device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN116168134B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117376596A (en) * 2023-12-08 2024-01-09 江西拓世智能科技股份有限公司 Live broadcast method, device and storage medium based on intelligent digital human model
CN117376596B (en) * 2023-12-08 2024-04-26 江西拓世智能科技股份有限公司 Live broadcast method, device and storage medium based on intelligent digital human model

Citations (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2001080180A2 (en) * 2000-04-14 2001-10-25 Smyleventures, Inc. Method and apparatus for displaying assets, goods and services
CN110598671A (en) * 2019-09-23 2019-12-20 腾讯科技(深圳)有限公司 Text-based avatar behavior control method, apparatus, and medium
CN110941954A (en) * 2019-12-04 2020-03-31 深圳追一科技有限公司 Text broadcasting method and device, electronic equipment and storage medium
CN111541908A (en) * 2020-02-27 2020-08-14 北京市商汤科技开发有限公司 Interaction method, device, equipment and storage medium
CN113473159A (en) * 2020-03-11 2021-10-01 广州虎牙科技有限公司 Digital human live broadcast method and device, live broadcast management equipment and readable storage medium
CN113538641A (en) * 2021-07-14 2021-10-22 北京沃东天骏信息技术有限公司 Animation generation method and device, storage medium and electronic equipment
CN113901190A (en) * 2021-10-18 2022-01-07 深圳追一科技有限公司 Man-machine interaction method and device based on digital human, electronic equipment and storage medium
CN114201043A (en) * 2021-12-09 2022-03-18 北京百度网讯科技有限公司 Content interaction method, device, equipment and medium
CN114422647A (en) * 2021-12-24 2022-04-29 上海浦东发展银行股份有限公司 Digital person-based agent service method, apparatus, device, medium, and product
CN114596391A (en) * 2022-01-19 2022-06-07 阿里巴巴(中国)有限公司 Virtual character control method, device, equipment and storage medium
CN115082602A (en) * 2022-06-15 2022-09-20 北京百度网讯科技有限公司 Method for generating digital human, training method, device, equipment and medium of model
CN115376487A (en) * 2022-08-19 2022-11-22 北京百度网讯科技有限公司 Control method of digital human, model training method and device
CN115423905A (en) * 2022-08-30 2022-12-02 阿里巴巴(中国)有限公司 Digital human driving method, system, device and storage medium

Patent Citations (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2001080180A2 (en) * 2000-04-14 2001-10-25 Smyleventures, Inc. Method and apparatus for displaying assets, goods and services
CN110598671A (en) * 2019-09-23 2019-12-20 腾讯科技(深圳)有限公司 Text-based avatar behavior control method, apparatus, and medium
CN110941954A (en) * 2019-12-04 2020-03-31 深圳追一科技有限公司 Text broadcasting method and device, electronic equipment and storage medium
CN111541908A (en) * 2020-02-27 2020-08-14 北京市商汤科技开发有限公司 Interaction method, device, equipment and storage medium
CN113473159A (en) * 2020-03-11 2021-10-01 广州虎牙科技有限公司 Digital human live broadcast method and device, live broadcast management equipment and readable storage medium
CN113538641A (en) * 2021-07-14 2021-10-22 北京沃东天骏信息技术有限公司 Animation generation method and device, storage medium and electronic equipment
CN113901190A (en) * 2021-10-18 2022-01-07 深圳追一科技有限公司 Man-machine interaction method and device based on digital human, electronic equipment and storage medium
CN114201043A (en) * 2021-12-09 2022-03-18 北京百度网讯科技有限公司 Content interaction method, device, equipment and medium
CN114422647A (en) * 2021-12-24 2022-04-29 上海浦东发展银行股份有限公司 Digital person-based agent service method, apparatus, device, medium, and product
CN114596391A (en) * 2022-01-19 2022-06-07 阿里巴巴(中国)有限公司 Virtual character control method, device, equipment and storage medium
CN115082602A (en) * 2022-06-15 2022-09-20 北京百度网讯科技有限公司 Method for generating digital human, training method, device, equipment and medium of model
CN115376487A (en) * 2022-08-19 2022-11-22 北京百度网讯科技有限公司 Control method of digital human, model training method and device
CN115423905A (en) * 2022-08-30 2022-12-02 阿里巴巴(中国)有限公司 Digital human driving method, system, device and storage medium

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
LI HU, ET AL: "A Virtual Character Generation and Animation System for E-Commerce Live Streaming", 《VIRTUAL EVENT》, pages 1202 - 1211 *
武海玲 等: "浅析3D超写实数字人技术在直播场景中的应用与创新", 《中国传媒科技》, pages 14 - 17 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117376596A (en) * 2023-12-08 2024-01-09 江西拓世智能科技股份有限公司 Live broadcast method, device and storage medium based on intelligent digital human model
CN117376596B (en) * 2023-12-08 2024-04-26 江西拓世智能科技股份有限公司 Live broadcast method, device and storage medium based on intelligent digital human model

Also Published As

Publication number Publication date
CN116168134B (en) 2024-01-02

Similar Documents

Publication Publication Date Title
KR102503413B1 (en) Animation interaction method, device, equipment and storage medium
CN110968736B (en) Video generation method and device, electronic equipment and storage medium
US20200125920A1 (en) Interaction method and apparatus of virtual robot, storage medium and electronic device
KR101992424B1 (en) Apparatus for making artificial intelligence character for augmented reality and service system using the same
WO2021109376A1 (en) Method and device for producing multiple camera-angle effect, and related product
CN110110104B (en) Method and device for automatically generating house explanation in virtual three-dimensional space
CN110868635B (en) Video processing method and device, electronic equipment and storage medium
CN111147877B (en) Virtual gift presenting method, device, equipment and storage medium
JP2021168139A (en) Method, device, apparatus and medium for man-machine interactions
KR102186607B1 (en) System and method for ballet performance via augumented reality
CN113835522A (en) Sign language video generation, translation and customer service method, device and readable medium
KR101743764B1 (en) Method for providing ultra light-weight data animation type based on sensitivity avatar emoticon
CN111601145A (en) Content display method, device and equipment based on live broadcast and storage medium
CN112399258A (en) Live playback video generation playing method and device, storage medium and electronic equipment
CN106653050A (en) Method for matching animation mouth shapes with voice in real time
CN113923462A (en) Video generation method, live broadcast processing method, video generation device, live broadcast processing device and readable medium
JP2009049905A (en) Stream processing server apparatus, stream filter type graph setting device, stream filter type graph setting system, stream processing method, stream filter type graph setting method, and computer program
CN113132741A (en) Virtual live broadcast system and method
CN114245099A (en) Video generation method and device, electronic equipment and storage medium
CN114463470A (en) Virtual space browsing method and device, electronic equipment and readable storage medium
KR20230098068A (en) Moving picture processing method, apparatus, electronic device and computer storage medium
CN117055724A (en) Generating type teaching resource system in virtual teaching scene and working method thereof
CN116168134B (en) Digital person control method, digital person control device, electronic equipment and storage medium
Putra et al. Designing translation tool: Between sign language to spoken text on kinect time series data using dynamic time warping
CN112637692B (en) Interaction method, device and equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant