WO2023010873A1 - Method and apparatus for audio driving of avatar, and electronic device - Google Patents

Method and apparatus for audio driving of avatar, and electronic device Download PDF

Info

Publication number
WO2023010873A1
WO2023010873A1 PCT/CN2022/084697 CN2022084697W WO2023010873A1 WO 2023010873 A1 WO2023010873 A1 WO 2023010873A1 CN 2022084697 W CN2022084697 W CN 2022084697W WO 2023010873 A1 WO2023010873 A1 WO 2023010873A1
Authority
WO
WIPO (PCT)
Prior art keywords
information
behavior
audio
model
time
Prior art date
Application number
PCT/CN2022/084697
Other languages
French (fr)
Chinese (zh)
Inventor
祝丰年
张保
Original Assignee
达闼机器人股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 达闼机器人股份有限公司 filed Critical 达闼机器人股份有限公司
Publication of WO2023010873A1 publication Critical patent/WO2023010873A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/16Sound input; Sound output
    • G06F3/167Audio in a user interface, e.g. using voice commands for navigating, audio feedback
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/01Input arrangements or combined input and output arrangements for interaction between user and computer
    • G06F3/048Interaction techniques based on graphical user interfaces [GUI]
    • G06F3/0487Interaction techniques based on graphical user interfaces [GUI] using specific features provided by the input device, e.g. functions controlled by the rotation of a mouse with dual sensing arrangements, or of the nature of the input device, e.g. tap gestures based on pressure sensed by a digitiser

Definitions

  • the present disclosure relates to the field of avatars, and in particular to a method, device and electronic equipment for audio driving avatars.
  • the purpose of the embodiments of the present invention is to provide a method for audio-driven avatar behavior, which drives virtual tasks to execute mouth shapes, facial expressions and related body movements according to the semantics and context of the current audio information.
  • the behavior model data of mouth shapes, facial expressions and body movements are generated, and then the audio information, the corresponding text information and the behavior model are associated at key time points to form the associated content of the audio model.
  • the behavior of the avatar is driven according to the associated content of the model, and the audio information and the behavior of the avatar can be synchronized.
  • an embodiment of the present invention provides a method for audio-driven avatar behavior, including:
  • Synchronization between audio information and behavior of the avatar is performed.
  • the generating a behavior model according to the audio information and text information combined with scene information includes:
  • the corresponding behavior models of mouth shapes, expressions and actions are generated according to time.
  • the scene information includes the semantics and context of the audio information and the context scene of the avatar's behavior.
  • associating the audio information, the text information, and the behavior model with time nodes to form audio model associated content includes:
  • the text information corresponding to the audio information, and the behavior model, the key time nodes corresponding to the start time and the duration are associated to form audio model associated content associated with time nodes;
  • the associated content of the audio model includes the audio information, text information, behavior model content and association relationship.
  • the driving of the behavior of the avatar according to the associated content of the model includes:
  • the behavior of the avatar is driven by the behavior model information in the model-associated content.
  • the synchronization between the audio information and the behavior of the avatar includes:
  • the time nodes include time nodes corresponding to the start time and the duration.
  • the synchronization between the audio information and the behavior of the avatar further includes:
  • each segment having a respective start time and duration
  • the synchronization between the audio information and the behavior of the avatar further includes:
  • the Nth behavior information in the behavior model information that is sequentially related to the audio information is synchronized with the audio information through the time node corresponding to the start time and duration;
  • the N is a natural number greater than 1.
  • the method also includes:
  • the behavior information unrelated to the audio information in the behavior model information is driven according to the non-associated time node array.
  • non-associated time nodes are set as equal time intervals or discrete time intervals.
  • the most relevant first behavior information is lip behavior information
  • the second most relevant behavior information is expression behavior information
  • the sequentially related Nth behavior information includes body movement behavior information.
  • an audio-driven device for avatar behavior including:
  • a receiving module configured to receive audio information
  • a text information generating module configured to generate text information according to the audio information
  • a behavior model generation module used to generate a behavior model according to the audio information and text information in combination with scene information
  • the audio model association module is used to associate the audio information, the text information and the behavior model in conjunction with time nodes to form audio model associated content;
  • a driving module configured to drive the behavior of the avatar according to the associated content of the model
  • the synchronization module is used for synchronizing the audio information and the behavior of the avatar.
  • an electronic device including:
  • a processor configured to run the computer-readable instructions, so that the electronic device implements the method described in any one of the above first aspects.
  • an embodiment of the present disclosure provides a non-transitory computer-readable storage medium for storing computer-readable instructions.
  • the computer-readable instructions When executed by a computer, the computer implements the above-mentioned first aspect. any one of the methods described.
  • the embodiment of the present disclosure discloses a method, device, electronic device, and computer-readable storage medium for audio-driven avatar behavior, wherein the method includes: receiving audio information; generating text information according to the audio information; Information and text information are combined with scene information to generate a behavior model; the audio information, the text information, and the behavior model are associated with time nodes to form audio model related content; according to the model related content, the avatar behavior is driven ; Perform synchronization between the audio information and the behavior of the avatar.
  • the audio-driven method of avatar behavior disclosed in the present disclosure, the behavior of avatar can be driven in association, and the audio information, text information and avatar behavior can be synchronized at time nodes, and the audio information can be accurately synchronized with the mouth movements of the avatar. Combine and synchronize facial expressions and body movements with current audio content.
  • FIG. 1 is a schematic flowchart of a method for audio-driven avatar behavior provided by an embodiment of the present disclosure
  • FIG. 2 is a schematic diagram of a system for audio-driven avatar behavior provided by an embodiment of the present disclosure
  • Fig. 3 is a schematic diagram of the audio model associated content structure provided by an embodiment of the present disclosure.
  • FIG. 4 is a schematic diagram of a data format of audio information provided by an embodiment of the present disclosure.
  • FIG. 5 is a schematic diagram of time synchronization of elements of audio model-related content provided by an embodiment of the present disclosure
  • FIG. 6 is a schematic diagram of an audio-driven avatar behavior device provided by another embodiment of the present disclosure.
  • Fig. 7 is a schematic structural diagram of an electronic device provided by another embodiment of the present disclosure.
  • the term “comprise” and its variations are open-ended, ie “including but not limited to”.
  • the term “based on” is “based at least in part on”.
  • the term “one embodiment” means “at least one embodiment”; the term “another embodiment” means “at least one further embodiment”; the term “some embodiments” means “at least some embodiments.” Relevant definitions of other terms will be given in the description below.
  • Fig. 1 is a schematic flow chart of the method for audio-driven avatar behavior provided by an embodiment of the present disclosure.
  • the audio-driven avatar behavior method provided in this embodiment can be executed by an audio-driven avatar behavior device, which can realize Implemented as software, or implemented as a combination of software and hardware, the device may be integrated and set in a certain device in the audio-driven avatar behavior system, such as a terminal device.
  • the method includes the following steps:
  • Step S101 Receive audio information.
  • the smart device receives audio information.
  • the smart device can be a smart robot, a smart terminal, or other smart devices with screen display. Play the animation of the avatar by itself, and also interact with the user.
  • the smart device in this embodiment takes the smart robot as an example. It has an anthropomorphic form, and the display screen of the head has the facial features of the virtual portrait.
  • the smart device receives the audio information, it cooperates with the mouth shape of the virtual portrait.
  • the corresponding voice is played out synchronously, and at the same time, it can cooperate with the anthropomorphic expressions of the robot's virtual portrait, such as sad, laughing, smiling, crying, helpless, embarrassing and other expressions.
  • the robot can also realize other behaviors, such as waving hands, spreading hands, shaking the head, nodding, etc., and can also perform synchronously with the mouth shape and expression of the virtual portrait according to the audio information.
  • the receiving audio information in this embodiment can adopt when the user interacts with the intelligent robot, the intelligent robot collects the voice information of the user in real time, and uses it as the source of the audio information, and also can call the audio information in the external or internal storage device, the audio
  • the source of information is not limited to this.
  • Step S102 Generate text information according to the audio information.
  • step S102 the process of dialogue and interaction between the user and the intelligent robot involves the input and reception of audio information, the dialogue between the user and the robot includes dialogue information related to audio, and the dialogue information includes characteristic data of the dialogue content, then the process of acquiring dialogue information It is the process of determining the feature data of the dialogue content: obtaining the original text information, which is the text information corresponding to the dialogue content; extracting the text feature data from the original text information; using the text feature data as the feature data of the dialogue content .
  • the process of dialogue and interaction between the robot and the interactive object is usually: the robot speaks a paragraph, and the interactive object replies to the paragraph; or, the interactive object speaks a paragraph, and the robot replies to the interactive object; it can also be an interactive object and the robot may speak the first paragraph at the same time. Therefore, the original text information may be generated by the robot, or by the interactive object, or may be generated by both the robot and the interactive object.
  • the process of determining to obtain the original text information is introduced respectively:
  • Scenario 1 When the original text information is the text information corresponding to the dialogue content generated by the robot.
  • Obtaining the original text information specifically includes: obtaining the text information to be played by the robot; and using the text information to be played as the original text information.
  • Scenario 2 When the original text information is the text information corresponding to the dialogue content generated by the interactive object.
  • Obtaining the original text information specifically includes: collecting audio data emitted by the interactive object when speaking; performing speech recognition on the audio data, and using the speech recognition result as the original text information.
  • Scenario 3 When the original text information includes the text information corresponding to the dialog content generated by the robot, and the text information of the dialog content generated by the interactive object.
  • the text information to be played by the robot can be obtained according to the method in scenario 1, and the text information of the dialog content generated by the interactive object can be obtained according to the method in scenario 2, and the text information to be played and the text information of the dialog content generated by the interactive object can be obtained Together as the original original data, the specific acquisition process in Scenario 1 and Scenario 2 will not be repeated here.
  • the text feature data is extracted from the original text information, which specifically includes: inputting the original text information into a preset text extraction model to obtain the text feature data.
  • the text extraction model is based on the original text information stored in the training library. , and the text feature data corresponding to each original text information are obtained through training.
  • the original text information stored in the training database is used as the input data of the text extraction model, and the text feature data corresponding to each original text information is used as the output data, and the Recurrent Neural Network (Recurrent Neural Network, referred to as " RNN”) model structure, the input data and output data are trained to determine the text extraction model, a typical recurrent neural network such as: Long Short-Term Memory (Long Short-Term Memory, referred to as "LSTM”) model architecture.
  • LSTM Long Short-Term Memory
  • the original text information is input into the text extraction model to obtain the text feature data.
  • Step S103 Generate a behavior model according to the audio information and text information combined with scene information.
  • step S103 according to the incoming audio content and corresponding text information, analyze the current semantics, context, and context scene of the avatar dialogue, and generate corresponding mouth shapes, expressions, and action behavior models according to time according to the audio and video content .
  • the behavior model is generated according to the audio information and text information combined with scene information, specifically, based on the received audio information and the corresponding text information, combined with the scene information, and according to the behavior of the avatar according to time. Behavioral models of corresponding mouth shapes, facial expressions, and actions.
  • the scene information includes the semantics and context of the audio information and the context scene of the avatar's behavior.
  • the behavior model is obtained in advance according to the audio information in the sample training library and the behavior actions corresponding to the audio information before acquiring the dialogue information.
  • the content of the dialogue produced by one party of the dialogue interaction will affect the behavior of the other party during the listening process of the interactive object, and the strongly related actions, such as mouth shapes, need to be synchronized with the audio information , the related action of the second time also needs to be synchronized with the audio information, so the behavior model has a corresponding relationship with the audio information.
  • mouth training in robot behavior is particularly important. Usually, a large amount of audio and video information is collected. The audio and video information contains a large amount of audio information and corresponding mouth movements.
  • Big data training can be carried out based on the relationship between mouth shape and mouth shape, and the corresponding mouth shape movement can be obtained.
  • the mouth movement of the robot can usually be combined with facial expressions and body movements at the same time.
  • the dialogue content sent by the robot of the interactive object will also have an impact on its own behavior.
  • each dialogue information in the sample training library and the behavior actions corresponding to each dialogue information can be obtained in the following ways:
  • Audio and video file data obtain a large amount of audio information and audio and video information, such as: collect 4000 audio and video files.
  • audio and video containing dialogue scenes can be collected, for example, audio and video files of talk shows.
  • talk shows There are only two people talking in talk shows. This situation is similar to that of robots. Therefore, the audio and video files of the talk show can be used as training data to accurately train the behavior model.
  • each audio and video file contains two interactive objects, and each audio and video file contains a complete dialogue scene
  • the processing process for each audio and video file is: the audio belonging to the interactive object A can be collected separately through speech recognition data, and audio data belonging to interactive object B, and convert the audio data of interactive object A into text data, and convert the audio data of interactive object B into text data.
  • the behavior A of the interactive object A and the behavior of the interactive object B are collected through image analysis. It can be understood that interactive object A and interactive object B are only used to distinguish two interactive objects in an audio and video file, and interactive object A (or interactive object B) in each audio and video file can be different from two individuals .
  • Step S104 Associate the audio information, the text information, and the behavior model with time nodes to form audio model associated content.
  • the audio information, the text information, and the behavior model are associated with time nodes to form audio model associated content, which specifically includes: according to the audio information, the audio
  • the text information corresponding to the information and the behavior model are associated through key time nodes corresponding to the start time and duration to form audio model associated content associated with time nodes, and the audio model associated content includes the audio information , text information, behavior model content and association relationship.
  • the behaviors in the behavior model are sorted according to the correlation with the audio information, and the most relevant behavior includes the first behavior information.
  • the mouth movement corresponding to the audio information is the first behavior, which is related to the audio information. the strongest correlation.
  • the behavior related to the audio information contains the second behavior information.
  • the facial expression of the avatar can be used as the second behavior, or the body movement can be used as the second behavior.
  • Other behaviors are based on the audio information.
  • the correlation degree is sorted, which is not strictly limited in the present disclosure.
  • the above time nodes include time nodes corresponding to the start time and the duration.
  • the most relevant mouth movements are strictly related according to the time points corresponding to the start time and duration of the audio information, and the sequentially related behaviors can be related to the audio information according to the approximate time node.
  • the approximate time node here can set a certain time interval, for example, in [-5s, +5s], [-3s, +3s], [-2s, +2s], [-1s, +1s], [ -0.5s,+0.5s] and so on.
  • Unrelated actions can be related by an array of unrelated time points on the time axis.
  • the non-associated time nodes are set as equal time intervals or discrete time intervals.
  • Step S105 Drive the behavior of the avatar according to the associated content of the model.
  • step S105 the behavior of the avatar is driven by the behavior model information in the model-associated content. It specifically includes: first analyzing the associated content of the model after association, obtaining the audio information, text information and behavior model information, and the association relationship between the above information; through the behavior model in the model association content The information drives the behavior of the avatar. Through the audio information and the association relationship, the driving between the audio information and the behavior of the avatar is performed. According to the order of correlation, the first behavior information most related to the audio information in the behavior model information is driven with the audio information through the time node corresponding to the start time and the duration.
  • the behavior model is The Nth line of information in the information that is sequentially related to the audio information is driven by the time node corresponding to the start time and the duration of the audio information; the N is a natural number greater than 1.
  • the behavior information that is not related to the audio information in the behavior model information is driven according to the non-associated time node array.
  • the non-associated time nodes are set as equal time intervals or discrete time intervals.
  • the most relevant first behavior information is lip behavior information; the second most relevant behavior information is expression behavior information; and the sequentially related Nth behavior information includes body movement behavior information. It should be noted that the related sequence in this embodiment is not a strict related sequence, and those skilled in the art can adjust the related behavior sequence, which is not strictly limited in the present disclosure.
  • Step S106 Perform synchronization between the audio information and the behavior of the avatar.
  • step S106 through the time axis with time node distribution established in the previous step, the audio information, text information and behavior model information are associated according to the time nodes on the time axis through the time nodes to form the following The association relationship associated with the time node; through the audio information and the association relationship, the synchronization between the audio information and the behavior of the avatar is performed; the time node includes the time node corresponding to the start time and duration .
  • the synchronization between the audio information and the behavior of the avatar further includes: dividing the audio information into a plurality of segments, each segment has its own start time and duration, and the corresponding text information Synchronize with the audio information through the time node corresponding to the start time and duration, and combine the first behavior information in the behavior model information most related to the audio information through the time node corresponding to the start time and duration with The audio information is synchronized.
  • the Nth behavior information sequentially related to the audio information is synchronized with the audio information through the time node corresponding to the start time and the duration; the N is a natural number greater than 1.
  • a non-associated time node array is set on the time axis; behavior information in the behavior model information that is not related to the audio information is driven according to the non-associated time node array.
  • the non-associated time nodes are set as equal time intervals or discrete time intervals.
  • the most relevant first behavior information is lip behavior information
  • the second most relevant behavior information is expression behavior information
  • the sequentially related Nth behavior information includes body movement behavior information. It should be noted that the related sequence in this embodiment is not a strict related sequence, and those skilled in the art can adjust the related behavior sequence, which is not strictly limited in the present disclosure.
  • Fig. 2 is a schematic diagram of the audio model related content structure provided by an embodiment of the present disclosure.
  • the associated content of the audio model is the file synthesized by the audio model file generation module in Fig. 1;
  • Fragmentation is performed according to audio content, text, and models.
  • Each shard contains audio, text, text models, and the association of each data at time nodes.
  • the audio model file synthesized correspondingly in FIG. 1 generally consists of multiple slices as shown in FIG. 2 , and each slice has the same data structure and different data content.
  • slice 1 Take slice 1 as an example for illustration, including:
  • Audio segment 1 It is the first piece of audio data in the audio file.
  • the segment of the audio segment can be a fixed size, or divided according to the audio content, and there is no restriction on the division method;
  • Text the text content corresponding to the audio clip
  • Behavior model the behavior model corresponding to audio clips and scenes, including expression, body, and mouth model data, including but not limited to these types of behavior models;
  • Start time This time is the start time corresponding to each of the above elements; corresponding to each of the above elements has its own start time, here is a unified introduction;
  • End time This time is the end time corresponding to each of the above elements; corresponding to each of the above elements has its own end time, here is a unified introduction;
  • This time is the duration corresponding to each of the above elements; corresponding to each of the above elements has its own duration, here is a unified introduction.
  • FIG. 3 is a schematic diagram of a data format of audio information provided by an embodiment of the present disclosure.
  • the data format of the audio information includes: segment index, segment size, start time, end time, text information, audio size, audio information and model content.
  • the content of the model includes behavior information and association relationships. Here, the content of the model is described emphatically. Since the composition of the audio model content has been explained in FIG. 3 , the definition and fragmentation of the model content in FIG. 2 are illustrated and described in detail in FIG. 3 .
  • MoudleData contains LipSync (mouth model), Expression (expression model), Action (body movement model), and each model will further explain the detailed content of the model: Name (the name of the subdivision action), Start (the start time), End (end time), and Data (model data of the subdivision action).
  • LipSync is the name of the mouth model, which contains a number of different mouth model elements X, Y, etc., each model element contains:
  • model name to distinguish different models
  • Fig. 4 shows a schematic diagram of time synchronization of elements of audio model-associated content provided by an embodiment of the present disclosure.
  • This graphic is a schematic diagram of the synchronization of each element of the audio model file, which includes time (the working time of the audio model file), audio (each segment of the audio), mouth shape, text, body movement, expression, and each element in the distribution on the time axis.
  • the corresponding audio is divided into two segments by words, and each segment has its own time description (start and duration); the description time of the corresponding text has the same time associated with the audio Description; the mouth shape model is related to the corresponding audio content, for example, the first audio clip in Figure 4 corresponds to the 1, 2, and 3 mouth shape models; the action behavior of waving is performed during the playback of the first audio, and at the same time The act of blinking. All elements work according to the same timeline.
  • FIG. 5 is a schematic diagram of an audio-driven device for avatar behavior provided by another embodiment of the present disclosure.
  • the audio-driven device for avatar behavior includes: a receiving module 501 , a text information generating module 502 , a behavior model generating module 503 , an audio model association module 504 , a driving module 505 and a synchronization module 506 . in:
  • the receiving module 501 is configured to receive audio information.
  • the smart device receives audio information.
  • the smart device in this embodiment takes an intelligent robot as an example. It has an anthropomorphic form, and the facial features of a virtual person are displayed on the display screen of the head. After the smart device receives the audio information, it cooperates with the virtual The mouth shape in the portrait will play the corresponding voice synchronously, and at the same time, it can match the anthropomorphic expressions of the robot's virtual portrait, such as sad, laughing, smiling, crying, helpless, embarrassing and other expressions.
  • the robot can also realize other behaviors, such as waving hands, spreading hands, shaking the head, nodding, etc., and can also perform synchronously with the mouth shape and expression of the virtual portrait according to the audio information.
  • the receiving audio information in this embodiment can adopt when the user interacts with the intelligent robot, the intelligent robot collects the voice information of the user in real time, and uses it as the source of the audio information, and also can call the audio information in the external or internal storage device, the audio
  • the source of information is not limited to this.
  • the text information generation module 502 is configured to generate text information according to the audio information.
  • the process of dialogue and interaction between the user and the intelligent robot involves the input and reception of audio information.
  • the dialogue between the user and the robot includes dialogue information involving audio, and the dialogue information includes the characteristic data of the dialogue content.
  • the process of obtaining dialogue information is to determine the dialogue content.
  • the process of feature data obtaining original text information, which is the text information corresponding to the dialogue content; extracting text feature data from the original text information; using the text feature data as feature data of the dialogue content.
  • the behavior model generation module 503 is configured to generate a behavior model according to the audio information and text information combined with scene information.
  • the incoming audio content and the corresponding text information analyze the current semantics, context and the context scene of the avatar dialogue, and generate the corresponding mouth shape, expression and action behavior model according to the time according to the audio and video content.
  • the behavior model is generated according to the audio information and text information combined with scene information, specifically, based on the received audio information and the corresponding text information, combined with the scene information, and according to the behavior of the avatar according to time. Behavioral models of corresponding mouth shapes, facial expressions, and actions.
  • the scene information includes the semantics and context of the audio information and the context scene of the avatar's behavior.
  • the audio model association module 504 is configured to associate the audio information, the text information, and the behavior model with time nodes to form audio model associated content.
  • Associating the audio information, the text information, and the behavior model with time nodes to form audio model associated content specifically including: according to the audio information, the text information corresponding to the audio information, and the behavior model,
  • the key time nodes corresponding to the start time and duration are associated to form audio model associated content associated with time nodes, and the audio model associated content includes the audio information, text information, behavior model content, and association relationship.
  • the behaviors in the behavior model are sorted according to the correlation with the audio information, and the most relevant behavior action includes the first behavior information.
  • the mouth movement corresponding to the audio information is the first behavior action, which is related to the audio information. the strongest correlation.
  • the behaviors related to the audio information include the second behavior information.
  • the facial expressions of the avatar can be used as the second behavior, or body movements can be used as the second behavior.
  • Other behaviors are based on the audio information.
  • the correlation degree is sorted, which is not strictly limited in the present disclosure. Establish a time axis with a time node distribution, and associate audio information, text information, and behavior model information through the time nodes according to the time nodes on the time axis to form an association relationship associated with the time nodes, so
  • the above time nodes include time nodes corresponding to the start time and the duration.
  • the driving module 505 is configured to drive the behavior of the avatar according to the associated content of the model.
  • the behavior model is The Nth line of information in the information that is sequentially related to the audio information is driven by the time node corresponding to the start time and the duration of the audio information; the N is a natural number greater than 1.
  • the behavior information that is not related to the audio information in the behavior model information is driven according to the non-associated time node array.
  • the non-associated time nodes are set as equal time intervals or discrete time intervals.
  • the most relevant first behavior information is lip behavior information; the second most relevant behavior information is expression behavior information; and the sequentially related Nth behavior information includes body movement behavior information. It should be noted that the related sequence in this embodiment is not a strict related sequence, and those skilled in the art can adjust the related behavior sequence, which is not strictly limited in the present disclosure.
  • the synchronization module 506 is used for synchronizing the audio information and the behavior of the avatar.
  • audio information, text information and behavior model information are associated through the time nodes to form an association relationship associated with the time nodes; through the audio information and the association relationship , performing synchronization between the audio information and the behavior of the avatar; the time node includes a time node corresponding to a start time and a duration.
  • the synchronization module 506 is further configured to: divide the audio information into a plurality of segments, each segment has its own start time and duration, and pass the corresponding text information and the audio information through the start time The time node corresponding to the duration is synchronized, and the first behavior information most related to the audio information in the behavior model information is synchronized with the audio information through the time node corresponding to the start time and the duration.
  • the synchronization module 506 is further configured to: synchronize the second behavior information in the behavior model information that is secondarily related to the audio information with the audio information through the time node corresponding to the start time and duration; According to the relevant order, the Nth behavior information in the behavior model information that is sequentially related to the audio information is synchronized with the audio information through the time node corresponding to the start time and duration; the N is a natural number greater than 1 .
  • the synchronization module 506 is further configured to: set a non-associated time node array on the time axis; drive behavior information in the behavior model information that is not related to the audio information according to the non-associated time node array .
  • the non-associated time nodes are set as equal time intervals or discrete time intervals.
  • the most relevant first behavior information is lip behavior information
  • the second most relevant behavior information is expression behavior information
  • the sequentially related Nth behavior information includes body movement behavior information. It should be noted that the related sequence in this embodiment is not a strict related sequence, and those skilled in the art can adjust the related behavior sequence, which is not strictly limited in the present disclosure.
  • the device shown in FIG. 5 can execute the method of the embodiment shown in FIG. 1 .
  • the device shown in FIG. 5 can execute the method of the embodiment shown in FIG. 1 .
  • FIG. 6 shows a system schematic diagram of an audio-driven avatar behavior provided by an embodiment of the present disclosure.
  • a schematic diagram of the system composition where the system schematic diagram is combined with the device diagram of audio-driven avatar behavior in accompanying drawing 5, according to the logical relationship between the modules, according to the system diagram, the behavior model generation module and the audio model file are shown The logical relationship between the generation module, the audio model analysis module, and the avatar behavior driver module.
  • the behavior model generation module analyzes the current semantics, context and avatar dialogue context scene according to the incoming audio content and corresponding text information, and generates corresponding mouth shapes, expressions and action behavior models according to time according to the video content.
  • the audio model file generation module according to the audio content, the text corresponding to the audio content, and the model data generated in the model generation module, according to the time node of the audio number playback, the audio, text and model, according to each element in the corresponding node in the audio playback
  • the start time, duration and other key time nodes are associated to form an audio model file associated with time nodes; the file includes time nodes (start time, duration), audio files, text, and model information.
  • the audio model file generation module corresponds to some functions of the text information generation module and the audio model association module in FIG. 6 .
  • the audio model file parsing module parses the audio model file to obtain audio content, model content, text content, and associations among the above contents, including but not limited to time associations.
  • the audio model file parsing module corresponds to the audio model association module in FIG. 6 .
  • the avatar behavior driving module drives the avatar behavior according to the model content analyzed in the audio analysis module.
  • the synchronization between the audio playback and the action model is performed through the association between the audio content and the model.
  • the avatar behavior driving module corresponds to the driving module in FIG. 6 .
  • a device for audio-driven avatar behavior further comprising:
  • the audio playing module plays the audio content parsed in the above audio model file parsing module.
  • the text display module displays the text content parsed in the above-mentioned audio model file parsing module.
  • FIG. 7 it shows a schematic structural diagram of an electronic device 700 suitable for implementing another embodiment of the present disclosure.
  • the terminal equipment in the embodiment of the present disclosure may include but not limited to such as mobile phone, notebook computer, digital broadcast receiver, PDA (personal digital assistant), PAD (tablet computer), PMP (portable multimedia player), vehicle terminal (such as mobile terminals such as car navigation terminals) and fixed terminals such as digital TVs, desktop computers and the like.
  • the electronic device shown in FIG. 7 is only an example, and should not limit the functions and application scope of the embodiments of the present disclosure.
  • an electronic device 700 may include a processing device (such as a central processing unit, a graphics processing unit, etc.) Various appropriate actions and processes are executed by programs in the memory (RAM) 703 . In the RAM 703, various programs and data necessary for the operation of the electronic device 700 are also stored.
  • the processing device 701, ROM 702, and RAM 703 are connected to each other through a communication line 704.
  • An input/output (I/O) interface 705 is also connected to the communication line 704 .
  • the following devices can be connected to the I/O interface 705: input devices 706 including, for example, a touch screen, touchpad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, etc.; including, for example, a liquid crystal display (LCD), speaker, vibration an output device 707 such as a computer; a storage device 708 including, for example, a magnetic tape, a hard disk, etc.; and a communication device 709.
  • the communication means 709 may allow the electronic device 700 to communicate with other devices wirelessly or by wire to exchange data. While FIG. 7 shows electronic device 700 having various means, it should be understood that implementing or having all of the means shown is not a requirement. More or fewer means may alternatively be implemented or provided.
  • embodiments of the present disclosure include a computer program product, which includes a computer program carried on a non-transitory computer readable medium, where the computer program includes program code for executing the method shown in the flowchart.
  • the computer program may be downloaded and installed from a network via communication means 709, or from storage means 708, or from ROM 702.
  • the processing device 701 When the computer program is executed by the processing device 701, the above-mentioned functions defined in the methods of the embodiments of the present disclosure are performed.
  • the above-mentioned computer-readable medium in the present disclosure may be a computer-readable signal medium or a computer-readable storage medium or any combination of the above two.
  • a computer readable storage medium may be, for example, but not limited to, an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination thereof. More specific examples of computer-readable storage media may include, but are not limited to, electrical connections with one or more wires, portable computer diskettes, hard disks, random access memory (RAM), read-only memory (ROM), erasable Programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), optical storage device, magnetic storage device, or any suitable combination of the above.
  • a computer-readable storage medium may be any tangible medium that contains or stores a program that can be used by or in conjunction with an instruction execution system, apparatus, or device.
  • a computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave carrying computer-readable program code therein. Such propagated data signals may take many forms, including but not limited to electromagnetic signals, optical signals, or any suitable combination of the foregoing.
  • a computer-readable signal medium may also be any computer-readable medium other than a computer-readable storage medium, which can transmit, propagate, or transmit a program for use by or in conjunction with an instruction execution system, apparatus, or device .
  • Program code embodied on a computer readable medium may be transmitted by any appropriate medium, including but not limited to wires, optical cables, RF (radio frequency), etc., or any suitable combination of the above.
  • the client and the server can communicate using any currently known or future network protocols such as HTTP (HyperText Transfer Protocol, Hypertext Transfer Protocol), and can communicate with digital data in any form or medium
  • HTTP HyperText Transfer Protocol
  • the communication eg, communication network
  • Examples of communication networks include local area networks (“LANs”), wide area networks (“WANs”), internetworks (e.g., the Internet), and peer-to-peer networks (e.g., ad hoc peer-to-peer networks), as well as any currently known or future developed network of.
  • the above-mentioned computer-readable medium may be included in the above-mentioned electronic device, or may exist independently without being incorporated into the electronic device.
  • the above-mentioned computer-readable medium carries one or more programs, and when the above-mentioned one or more programs are executed by the electronic device, the electronic device: executes the interaction method in the above-mentioned embodiment.
  • Computer program code for carrying out operations of the present disclosure may be written in one or more programming languages, or combinations thereof, including but not limited to object-oriented programming languages—such as Java, Smalltalk, C++, and Includes conventional procedural programming languages - such as the "C" language or similar programming languages.
  • the program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server.
  • the remote computer can be connected to the user computer through any kind of network, including a local area network (LAN) or a wide area network (WAN), or it can be connected to an external computer (such as through an Internet service provider). Internet connection).
  • LAN local area network
  • WAN wide area network
  • Internet service provider such as AT&T, MCI, Sprint, EarthLink, MSN, GTE, etc.
  • each block in a flowchart or block diagram may represent a module, program segment, or portion of code that contains one or more logical functions for implementing specified executable instructions.
  • the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or they may sometimes be executed in the reverse order, depending upon the functionality involved.
  • each block of the block diagrams and/or flowchart illustrations, and combinations of blocks in the block diagrams and/or flowchart illustrations can be implemented by a dedicated hardware-based system that performs the specified functions or operations , or may be implemented by a combination of dedicated hardware and computer instructions.
  • the units involved in the embodiments described in the present disclosure may be implemented by software or by hardware. Wherein, the name of a unit does not constitute a limitation of the unit itself under certain circumstances.
  • FPGAs Field Programmable Gate Arrays
  • ASICs Application Specific Integrated Circuits
  • ASSPs Application Specific Standard Products
  • SOCs System on Chips
  • CPLD Complex Programmable Logical device
  • a machine-readable medium may be a tangible medium that may contain or store a program for use by or in conjunction with an instruction execution system, apparatus, or device.
  • a machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium.
  • a machine-readable medium may include, but is not limited to, electronic, magnetic, optical, electromagnetic, infrared, or semiconductor systems, apparatus, or devices, or any suitable combination of the foregoing.
  • machine-readable storage media would include one or more wire-based electrical connections, portable computer discs, hard drives, random access memory (RAM), read only memory (ROM), erasable programmable read only memory (EPROM or flash memory), optical fiber, compact disk read only memory (CD-ROM), optical storage, magnetic storage, or any suitable combination of the foregoing.
  • RAM random access memory
  • ROM read only memory
  • EPROM or flash memory erasable programmable read only memory
  • CD-ROM compact disk read only memory
  • magnetic storage or any suitable combination of the foregoing.
  • an electronic device including: at least one processor; and a memory connected in communication with the at least one processor; wherein, the memory stores information that can be used by the Instructions executed by at least one processor, the instructions being executed by the at least one processor, so that the at least one processor can execute any one of the methods in the foregoing first aspect.
  • non-transitory computer-readable storage medium which is characterized in that the non-transitory computer-readable storage medium stores computer instructions, and the computer instructions are used to cause a computer to execute the aforementioned Any one of the methods of the first aspect.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Processing Or Creating Images (AREA)

Abstract

A method and an apparatus for audio driving of an avatar, and an electronic device, the method comprising: receiving audio information (S101); on the basis of the audio information, generating text information (S102); on the basis of the audio information and the text information in combination with scene information, generating a behaviour model (S103); associating the audio information, the text information, and the behaviour model with a time node to form audio model associated content (S104); on the basis of the model associated content, driving the behaviour of an avatar (S105); and implementing synchronisation between the audio information and the behaviour of the avatar (S106). The present method can associate and drive the behaviour of the avatar, and implement time node synchronisation of the audio information, text information and behaviour of the avatar, accurately synchronising audio information and avatar mouth movements, and simultaneously combining and synchronising facial expressions and body movements with current audio content.

Description

一种音频驱动虚拟人像的方法、装置及电子设备Method, device and electronic equipment for audio-driven virtual portrait
交叉引用cross reference
本申请要求2021年08月03日递交的、申请号为“202110888459.X”、发明名称为“一种音频驱动虚拟人像行为的方法、装置及电子设备”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims the priority of the Chinese patent application submitted on August 3, 2021, with the application number "202110888459.X" and the title of the invention "A Method, Device and Electronic Equipment for Audio-Driven Virtual Portrait Behavior", all of which The contents are incorporated by reference in this application.
技术领域technical field
本公开涉及虚拟人像领域,尤其涉及一种音频驱动虚拟人像的方法、装置及电子设备。The present disclosure relates to the field of avatars, and in particular to a method, device and electronic equipment for audio driving avatars.
背景技术Background technique
传统的交互的智能设备,在与用户进行交互时,在存在虚拟人像的交互中,往往仅涉及语音由虚拟人像简单输出,并不结合虚拟人像的口型,且虚拟人像的五官表情单一,无喜怒哀乐的丰富表情。在传统的音频驱动虚拟人物行为的方案中,即使虚拟人像在与用户语音交互中存在口型的变化,其也仅仅是重复性的简单张合动作,智能设备中的虚拟人像的嘴形,面部表情以及肢体行为,并不能通过音频实时流语音同步生成口型变形系数,驱动虚拟形象不能进行口型动作精准、面部表情逼真的表达。Traditional interactive smart devices, when interacting with users, often only involve the simple output of voice by the avatar in the interaction with the avatar, and do not combine the mouth shape of the avatar, and the facial features of the avatar are single. Rich expressions of joy, anger, sorrow and joy. In the traditional audio-driven avatar behavior scheme, even if there is a change in the mouth shape of the avatar in the voice interaction with the user, it is only a repetitive simple opening and closing action. The mouth shape of the avatar in the smart device, the face Expressions and body behaviors cannot generate lip deformation coefficients synchronously through audio real-time streams, and driving avatars cannot perform precise lip movements and realistic facial expressions.
因而,现有技术中通常存在如下问题:当生成口型变形系数的时候往往会较为耗时,导致音频信息和虚拟人物口型的动作的无法精准同步。同时无法进行面部表情和肢体动作和当前音频内容进行结合和同步。Therefore, the following problems usually exist in the prior art: it is often time-consuming to generate the lip deformation coefficients, which leads to the inability to accurately synchronize the audio information and the lip motion of the avatar. At the same time, it is impossible to combine and synchronize facial expressions and body movements with the current audio content.
发明内容Contents of the invention
本发明实施方式的目的在于提供一种音频驱动虚拟人像行为的方法, 根据当前音频信息的语义、语境来驱动虚拟任务进行口型,面部表情以及相关肢体动作的执行。通过在模型生成服务器或者模块进行预处理,生成口型,面部表情以及肢体动作的行为模型数据,然后通过关键时间点对音频信息、对应的文本信息以及行为模型进行关联,形成音频模型关联内容,据所述模型关联内容对虚拟人像行为进行驱动,并能够对音频信息和所述虚拟人像行为进行同步。The purpose of the embodiments of the present invention is to provide a method for audio-driven avatar behavior, which drives virtual tasks to execute mouth shapes, facial expressions and related body movements according to the semantics and context of the current audio information. Through preprocessing on the model generation server or module, the behavior model data of mouth shapes, facial expressions and body movements are generated, and then the audio information, the corresponding text information and the behavior model are associated at key time points to form the associated content of the audio model. The behavior of the avatar is driven according to the associated content of the model, and the audio information and the behavior of the avatar can be synchronized.
为了实现上述目的,第一方面,本发明的实施例提供了一种音频驱动虚拟人像行为的方法,包括:In order to achieve the above purpose, in the first aspect, an embodiment of the present invention provides a method for audio-driven avatar behavior, including:
接收音频信息;receive audio messages;
根据所述音频信息生成文本信息;generating text information according to the audio information;
根据所述音频信息和文本信息结合场景信息生成行为模型;Generate a behavior model according to the audio information and text information combined with scene information;
将所述音频信息、所述文本信息以及所述行为模型结合时间节点进行关联,形成音频模型关联内容;Associating the audio information, the text information, and the behavior model with time nodes to form audio model associated content;
根据所述模型关联内容对虚拟人像行为进行驱动;Drive the behavior of the avatar according to the associated content of the model;
进行音频信息和所述虚拟人像行为之间的同步。Synchronization between audio information and behavior of the avatar is performed.
进一步的,所述根据所述音频信息和文本信息结合场景信息生成行为模型,包括:Further, the generating a behavior model according to the audio information and text information combined with scene information includes:
根据接收的所述音频信息和对应的所述文本信息,结合所述场景信息,根据所述虚拟人像行为按照时间生成对应的口型、表情以及动作的行为模型。According to the received audio information and the corresponding text information, combined with the scene information, according to the behavior of the avatar, the corresponding behavior models of mouth shapes, expressions and actions are generated according to time.
进一步的,所述场景信息包括所述音频信息的语义、语境以及所述虚拟人像行为的上下文场景。Further, the scene information includes the semantics and context of the audio information and the context scene of the avatar's behavior.
进一步的,所述将所述音频信息、所述文本信息以及所述行为模型结合时间节点进行关联,形成音频模型关联内容,包括:Further, associating the audio information, the text information, and the behavior model with time nodes to form audio model associated content includes:
根据所述音频信息、所述音频信息对应的文本信息以及所述行为模型,通过起始时间和持续时间对应的关键时间节点进行关联,形成以时间节点进行关联的音频模型关联内容;According to the audio information, the text information corresponding to the audio information, and the behavior model, the key time nodes corresponding to the start time and the duration are associated to form audio model associated content associated with time nodes;
所述音频模型关联内容中包括所述音频信息、文本信息、行为模型内容以及关联关系。The associated content of the audio model includes the audio information, text information, behavior model content and association relationship.
进一步的,所述根据所述模型关联内容对虚拟人像行为进行驱动,包括:Further, the driving of the behavior of the avatar according to the associated content of the model includes:
将关联后的所述模型关联内容进行解析,获取所述音频信息、文本信息和行为模型信息,以及上述信息之间的关联关系;Analyzing the associated content of the associated models to obtain the audio information, text information, and behavior model information, as well as the association relationship between the above information;
通过所述模型关联内容中的所述行为模型信息对虚拟人像行为进行驱动。The behavior of the avatar is driven by the behavior model information in the model-associated content.
进一步的,所述进行音频信息和所述虚拟人像行为之间的同步,包括:Further, the synchronization between the audio information and the behavior of the avatar includes:
建立具有时间节点分布的时间轴;Establish a time axis with time node distribution;
按照所述时间轴上的时间节点将音频信息,文本信息以及行为模型信息,通过所述时间节点进行关联,形成以所述时间节点相关联的关联关系;Associating audio information, text information, and behavior model information through the time nodes according to the time nodes on the time axis to form an association relationship associated with the time nodes;
通过所述音频信息和所述关联关系,进行音频信息和所述虚拟人像行为之间的同步;Synchronize the audio information and the behavior of the avatar through the audio information and the association relationship;
所述时间节点包括起始时间和持续时间对应的时间节点。The time nodes include time nodes corresponding to the start time and the duration.
进一步的,所述进行音频信息和所述虚拟人像行为之间的同步,进一步包括:Further, the synchronization between the audio information and the behavior of the avatar further includes:
将所述音频信息划分为多个片段,每个片段有各自的起始时间和持续时间;dividing the audio information into a plurality of segments, each segment having a respective start time and duration;
将对应的所述文本信息与所述音频信息通过起始时间和持续时间对应的时间节点进行同步;Synchronizing the corresponding text information and the audio information through the time nodes corresponding to the start time and duration;
将所述行为模型信息中与所述音频信息最相关的第一行为信息通过起始时间和持续时间对应的时间节点与所述音频信息进行同步。Synchronize the first behavior information most related to the audio information in the behavior model information with the audio information through a time node corresponding to a start time and a duration.
所述进行音频信息和所述虚拟人像行为之间的同步,进一步包括:The synchronization between the audio information and the behavior of the avatar further includes:
将所述行为模型信息中与所述音频信息次相关的第二行为信息通过起始时间和持续时间对应的时间节点与所述音频信息进行同步;Synchronizing second behavior information in the behavior model information that is secondarily related to the audio information with the audio information at a time node corresponding to a start time and a duration;
以此按照相关顺序,将所述行为模型信息中与所述音频信息依次相关的第N行为信息通过起始时间和持续时间对应的时间节点与所述音频信息进行同步;In this way, according to the relevant order, the Nth behavior information in the behavior model information that is sequentially related to the audio information is synchronized with the audio information through the time node corresponding to the start time and duration;
所述N为大于1的自然数。The N is a natural number greater than 1.
进一步的,所述方法还包括:Further, the method also includes:
在所述时间轴上设置非关联时间节点阵列;setting an array of non-associated time nodes on the time axis;
将所述行为模型信息中与所述音频信息不相关的行为信息按照所述非关联时间节点阵列进行驱动。The behavior information unrelated to the audio information in the behavior model information is driven according to the non-associated time node array.
进一步的,所述非关联时间节点设置成等时间间隔或离散时间间隔。Further, the non-associated time nodes are set as equal time intervals or discrete time intervals.
进一步的,所述最相关的第一行为信息为口型行为信息;所述次相关的第二行为信息为表情行为信息;所述依次相关的第N行为信息包括肢体动作行为信息。Further, the most relevant first behavior information is lip behavior information; the second most relevant behavior information is expression behavior information; and the sequentially related Nth behavior information includes body movement behavior information.
第二方面,本公开实施例提供一种音频驱动虚拟人像行为的装置,包括:In the second aspect, an embodiment of the present disclosure provides an audio-driven device for avatar behavior, including:
接收模块,用于接收音频信息;A receiving module, configured to receive audio information;
文本信息生成模块,用于根据所述音频信息生成文本信息;A text information generating module, configured to generate text information according to the audio information;
行为模型生成模块,用于根据所述音频信息和文本信息结合场景信息生成行为模型;A behavior model generation module, used to generate a behavior model according to the audio information and text information in combination with scene information;
音频模型关联模块,用于将所述音频信息、所述文本信息以及所述行 为模型结合时间节点进行关联,形成音频模型关联内容;The audio model association module is used to associate the audio information, the text information and the behavior model in conjunction with time nodes to form audio model associated content;
驱动模块,用于根据所述模型关联内容对虚拟人像行为进行驱动;A driving module, configured to drive the behavior of the avatar according to the associated content of the model;
同步模块,用于进行音频信息和所述虚拟人像行为之间的同步。The synchronization module is used for synchronizing the audio information and the behavior of the avatar.
第三方面,本公开实施例提供一种电子设备,包括:In a third aspect, an embodiment of the present disclosure provides an electronic device, including:
存储器,用于存储计算机可读指令;以及memory for storing computer readable instructions; and
处理器,用于运行所述计算机可读指令,使得所述电子设备实现上述第一方面中任意一项所述的方法。A processor, configured to run the computer-readable instructions, so that the electronic device implements the method described in any one of the above first aspects.
第四方面,本公开实施例提供一种非暂态计算机可读存储介质,用于存储计算机可读指令,当所述计算机可读指令由计算机执行时,使得所述计算机实现上述第一方面中任意一项所述的方法。In a fourth aspect, an embodiment of the present disclosure provides a non-transitory computer-readable storage medium for storing computer-readable instructions. When the computer-readable instructions are executed by a computer, the computer implements the above-mentioned first aspect. any one of the methods described.
本公开实施例公开了一种音频驱动虚拟人像行为的方法、装置、电子设备和计算机可读存储介质,其中所述方法包括:接收音频信息;根据所述音频信息生成文本信息;根据所述音频信息和文本信息结合场景信息生成行为模型;将所述音频信息、所述文本信息以及所述行为模型结合时间节点进行关联,形成音频模型关联内容;根据所述模型关联内容对虚拟人像行为进行驱动;进行音频信息和所述虚拟人像行为之间的同步。通过本公开的音频驱动虚拟人像行为的方法,能够关联驱动虚拟人像行为,并且将音频信息、文本信息和虚拟人像行为进行时间节点的同步,精准同步音频信息和虚拟人物口型的动作,同时无法进行面部表情和肢体动作和当前音频内容进行结合和同步。The embodiment of the present disclosure discloses a method, device, electronic device, and computer-readable storage medium for audio-driven avatar behavior, wherein the method includes: receiving audio information; generating text information according to the audio information; Information and text information are combined with scene information to generate a behavior model; the audio information, the text information, and the behavior model are associated with time nodes to form audio model related content; according to the model related content, the avatar behavior is driven ; Perform synchronization between the audio information and the behavior of the avatar. Through the audio-driven method of avatar behavior disclosed in the present disclosure, the behavior of avatar can be driven in association, and the audio information, text information and avatar behavior can be synchronized at time nodes, and the audio information can be accurately synchronized with the mouth movements of the avatar. Combine and synchronize facial expressions and body movements with current audio content.
上述说明仅是本公开技术方案的概述,为了能更清楚了解本公开的技术手段,而可依照说明书的内容予以实施,并且为让本公开的上述和其他目的、特征和优点能够更明显易懂,以下特举较佳实施例,并配合附图,详细说明如下。The above description is only an overview of the technical solution of the present disclosure. In order to better understand the technical means of the present disclosure, it can be implemented according to the contents of the specification, and in order to make the above and other purposes, features and advantages of the present disclosure more obvious and understandable , the following preferred embodiments are specifically cited below, and are described in detail as follows in conjunction with the accompanying drawings.
附图说明Description of drawings
通过阅读下文优选实施方式的详细描述,各种其他的优点和益处对于本领域普通技术人员将变得清楚明了。附图仅用于示出优选实施方式的目的,而并不认为是对本公开的限制。而且在整个附图中,用相同的参考符号表示相同的部件。在附图中:Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiment. The drawings are only for the purpose of illustrating the preferred embodiments and are not to be considered as limiting the present disclosure. Also throughout the drawings, the same reference numerals are used to designate the same components. In the attached picture:
图1为本公开一实施例提供的音频驱动虚拟人像行为的方法的流程示意图;FIG. 1 is a schematic flowchart of a method for audio-driven avatar behavior provided by an embodiment of the present disclosure;
图2为本公开一实施例提供的音频驱动虚拟人像行为的系统示意图;FIG. 2 is a schematic diagram of a system for audio-driven avatar behavior provided by an embodiment of the present disclosure;
图3为本公开一实施例提供的音频模型关联内容结构示意图;Fig. 3 is a schematic diagram of the audio model associated content structure provided by an embodiment of the present disclosure;
图4为本公开一实施例提供的音频信息的数据格式示意图;FIG. 4 is a schematic diagram of a data format of audio information provided by an embodiment of the present disclosure;
图5为本公开一实施例提供的音频模型关联内容各元素进行时间同步示意图;FIG. 5 is a schematic diagram of time synchronization of elements of audio model-related content provided by an embodiment of the present disclosure;
图6为本公开另一实施例提供的音频驱动虚拟人像行为的装置示意图;FIG. 6 is a schematic diagram of an audio-driven avatar behavior device provided by another embodiment of the present disclosure;
图7为本公开另一实施例提供的电子设备的结构示意图。Fig. 7 is a schematic structural diagram of an electronic device provided by another embodiment of the present disclosure.
具体实施方式Detailed ways
为了能够更清楚地描述本公开的技术内容,下面结合具体实施例来进行进一步的描述。In order to describe the technical content of the present disclosure more clearly, further description will be given below in conjunction with specific embodiments.
以下示例性实施例中所描述的实施方式并不代表与本申请相一致的所有实施方式。相反,它们仅是与如所附权利要求书中所详述的、本申请的一些方面相一致的装置和方法的例子。The implementations described in the following exemplary embodiments do not represent all implementations consistent with this application. Rather, they are merely examples of apparatuses and methods consistent with aspects of the present application as recited in the appended claims.
下面将参照附图更详细地描述本公开的实施例。虽然附图中显示了本公开的某些实施例,然而应当理解的是,本公开可以通过各种形式来实现,而 且不应该被解释为限于这里阐述的实施例,相反提供这些实施例是为了更加透彻和完整地理解本公开。应当理解的是,本公开的附图及实施例仅用于示例性作用,并非用于限制本公开的保护范围。Embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. Although certain embodiments of the present disclosure are shown in the drawings, it should be understood that the disclosure may be embodied in various forms and should not be construed as limited to the embodiments set forth herein; A more thorough and complete understanding of the present disclosure. It should be understood that the drawings and embodiments of the present disclosure are for exemplary purposes only, and are not intended to limit the protection scope of the present disclosure.
应当理解,本公开的方法实施方式中记载的各个步骤可以按照不同的顺序执行,和/或并行执行。此外,方法实施方式可以包括附加的步骤和/或省略执行示出的步骤。本公开的范围在此方面不受限制。It should be understood that the various steps described in the method implementations of the present disclosure may be executed in different orders, and/or executed in parallel. Additionally, method embodiments may include additional steps and/or omit performing illustrated steps. The scope of the present disclosure is not limited in this respect.
本文使用的术语“包括”及其变形是开放性包括,即“包括但不限于”。术语“基于”是“至少部分地基于”。术语“一个实施例”表示“至少一个实施例”;术语“另一实施例”表示“至少一个另外的实施例”;术语“一些实施例”表示“至少一些实施例”。其他术语的相关定义将在下文描述中给出。As used herein, the term "comprise" and its variations are open-ended, ie "including but not limited to". The term "based on" is "based at least in part on". The term "one embodiment" means "at least one embodiment"; the term "another embodiment" means "at least one further embodiment"; the term "some embodiments" means "at least some embodiments." Relevant definitions of other terms will be given in the description below.
需要注意,本公开中提及的“第一”、“第二”等概念仅用于对不同的装置、模块或单元进行区分,并非用于限定这些装置、模块或单元所执行的功能的顺序或者相互依存关系。It should be noted that concepts such as "first" and "second" mentioned in this disclosure are only used to distinguish different devices, modules or units, and are not used to limit the sequence of functions performed by these devices, modules or units or interdependence.
需要注意,本公开中提及的“一个”、“多个”的修饰是示意性而非限制性的,本领域技术人员应当理解,除非在上下文另有明确指出,否则应该理解为“一个或多个”。It should be noted that the modifications of "one" and "multiple" mentioned in the present disclosure are illustrative and not restrictive, and those skilled in the art should understand that unless the context clearly indicates otherwise, it should be understood as "one or more" multiple".
在本公开使用的术语是仅仅出于描述特定实施例的目的,而非旨在限制本申请。在本公开和所附权利要求书中所使用的单数形式的“一种”、“所述”和“该”也旨在包括多数形式,除非上下文清楚地表示其他含义。下面参考附图详细描述公开的各实施方式。The terminology used in this disclosure is for the purpose of describing particular embodiments only and is not intended to be limiting of the application. As used in this disclosure and the appended claims, the singular forms "a", "the", and "the" are intended to include the plural forms as well, unless the context clearly dictates otherwise. The disclosed embodiments are described in detail below with reference to the accompanying drawings.
图1为本公开实施例提供的音频驱动虚拟人像行为的方法的流程示意图,本实施例提供的该音频驱动虚拟人像行为的方法可以由一音频驱动虚拟人像行为的装置来执行,该装置可以实现为软件,或者实现为软件和硬件的组合, 该装置可以集成设置在音频驱动虚拟人像行为的系统中的某设备中,比如终端设备中。如图1所示,该方法包括如下步骤:Fig. 1 is a schematic flow chart of the method for audio-driven avatar behavior provided by an embodiment of the present disclosure. The audio-driven avatar behavior method provided in this embodiment can be executed by an audio-driven avatar behavior device, which can realize Implemented as software, or implemented as a combination of software and hardware, the device may be integrated and set in a certain device in the audio-driven avatar behavior system, such as a terminal device. As shown in Figure 1, the method includes the following steps:
步骤S101:接收音频信息。Step S101: Receive audio information.
在步骤S101中,智能设备接收音频信息,此处智能设备可以为智能机器人、智能终端以及其它具有屏幕显示的智能设备等,该智能设备可显示虚拟人像,例如可呈现虚拟人像的动画效果,可自行播放虚拟人像的动画,也可与用户进行交互行为。本实施例中的智能设备以智能机器人为例,其具有拟人的形态,头部的显示屏上具有虚拟人像的五官显示,在智能设备接收到音频信息后,和配合虚拟人像中的口型,同步地将对应的语音播放出来,同时可配合机器人的虚拟人像的拟人表情,比如伤心、大笑、微笑、大哭、无奈、尴尬等表情。另外,该机器人还可实现其它行为,例如摆手、摊手、摇头、点头等,也可根据音频信息,配合虚拟人像口型和表情同步表现出来。本实施例中的接收音频信息,可以采用用户与智能机器人交互时,智能机器人实时采集用户的语音信息,将其作为音频信息的来源,也可调取外部或内部存储设备中的音频信息,音频信息的来源不限于此。In step S101, the smart device receives audio information. Here, the smart device can be a smart robot, a smart terminal, or other smart devices with screen display. Play the animation of the avatar by itself, and also interact with the user. The smart device in this embodiment takes the smart robot as an example. It has an anthropomorphic form, and the display screen of the head has the facial features of the virtual portrait. After the smart device receives the audio information, it cooperates with the mouth shape of the virtual portrait. The corresponding voice is played out synchronously, and at the same time, it can cooperate with the anthropomorphic expressions of the robot's virtual portrait, such as sad, laughing, smiling, crying, helpless, embarrassing and other expressions. In addition, the robot can also realize other behaviors, such as waving hands, spreading hands, shaking the head, nodding, etc., and can also perform synchronously with the mouth shape and expression of the virtual portrait according to the audio information. The receiving audio information in this embodiment can adopt when the user interacts with the intelligent robot, the intelligent robot collects the voice information of the user in real time, and uses it as the source of the audio information, and also can call the audio information in the external or internal storage device, the audio The source of information is not limited to this.
步骤S102:根据所述音频信息生成文本信息。Step S102: Generate text information according to the audio information.
在步骤S102中,用户与智能机器人对话交互的过程中涉及音频信息的输入和接收,用户与机器人的对话中包括涉及音频的对话信息,对话信息包括对话内容的特征数据,则获取对话信息的过程即为确定对话内容的特征数据的过程:获取原始文本信息,原始文本信息为对话内容所对应的文本信息;从该原始文本信息中提取文本特征数据;将文本特征数据,作为对话内容的特征数据。In step S102, the process of dialogue and interaction between the user and the intelligent robot involves the input and reception of audio information, the dialogue between the user and the robot includes dialogue information related to audio, and the dialogue information includes characteristic data of the dialogue content, then the process of acquiring dialogue information It is the process of determining the feature data of the dialogue content: obtaining the original text information, which is the text information corresponding to the dialogue content; extracting the text feature data from the original text information; using the text feature data as the feature data of the dialogue content .
具体的说,通常机器人与交互对象进行对话交互的过程为:机器人说一段话,交互对象回复该段话;或者,交互对象说一段话,而机器人对交互对 象的话进行回复;还可以是交互对象和机器人可能同时说出第一段话。因而该原始文本信息可以为该机器人产生,也可以是交互对象产生,还可以同时包括机器人和交互对象同时产生。本实施方式中,按照上述三种情况,分别介绍确定获取原始文本信息的过程:Specifically, the process of dialogue and interaction between the robot and the interactive object is usually: the robot speaks a paragraph, and the interactive object replies to the paragraph; or, the interactive object speaks a paragraph, and the robot replies to the interactive object; it can also be an interactive object and the robot may speak the first paragraph at the same time. Therefore, the original text information may be generated by the robot, or by the interactive object, or may be generated by both the robot and the interactive object. In this embodiment, according to the above three situations, the process of determining to obtain the original text information is introduced respectively:
情境一:当原始文本信息为机器人产生的对话内容所对应的文本信息。Scenario 1: When the original text information is the text information corresponding to the dialogue content generated by the robot.
获取原始文本信息,具体包括:获取机器人待播放的文本信息;并将待播放的文本信息作为原始文本信息。Obtaining the original text information specifically includes: obtaining the text information to be played by the robot; and using the text information to be played as the original text information.
情境二:当原始文本信息为交互对象产生的对话内容所对应的文本信息。Scenario 2: When the original text information is the text information corresponding to the dialogue content generated by the interactive object.
获取原始文本信息,具体包括:采集交互对象说话时发出的音频数据;对该音频数据进行语音识别,并将该语音识别结果作为原始文本信息。Obtaining the original text information specifically includes: collecting audio data emitted by the interactive object when speaking; performing speech recognition on the audio data, and using the speech recognition result as the original text information.
情境三:当原始文本信息包括机器人产生的对话内容所对应的文本信息,以及交互对象产生的对话内容的文本信息。Scenario 3: When the original text information includes the text information corresponding to the dialog content generated by the robot, and the text information of the dialog content generated by the interactive object.
可以按照情境一中的方式获取机器人待播放的文本信息,并按照情境二中的方式获取交互对象产生的对话内容的文本信息,将该待播放的文本信息和交互对象产生的对话内容的文本信息共同作为该原始原本数据,其中,情境一中和情境二中具体的获取过程此处将不再赘述。The text information to be played by the robot can be obtained according to the method in scenario 1, and the text information of the dialog content generated by the interactive object can be obtained according to the method in scenario 2, and the text information to be played and the text information of the dialog content generated by the interactive object can be obtained Together as the original original data, the specific acquisition process in Scenario 1 and Scenario 2 will not be repeated here.
一个具体的实现中,从原始文本信息中提取文本特征数据,具体包括:将原始文本信息输入预设的文本提取模型,获得文本特征数据,文本提取模型是根据训练库中存储的各原始文本信息,以及与各原始文本信息对应的文本特征数据训练获得。In a specific implementation, the text feature data is extracted from the original text information, which specifically includes: inputting the original text information into a preset text extraction model to obtain the text feature data. The text extraction model is based on the original text information stored in the training library. , and the text feature data corresponding to each original text information are obtained through training.
具体的说,将训练库中存储的各原始文本信息作为该文本提取模型的输入数据,将与各原始文本信息对应的文本特征数据作为输出数据,可以采用循环神经网络(Recurrent Neural Network,简称“RNN”)的模型结构,对输入 数据和输出数据进行训练,即可确定出该文本提取模型,典型的循环神经网络如:长短期记忆网络(Long Short-TermMemory,简称“LSTM”)模型架构。Specifically, the original text information stored in the training database is used as the input data of the text extraction model, and the text feature data corresponding to each original text information is used as the output data, and the Recurrent Neural Network (Recurrent Neural Network, referred to as " RNN") model structure, the input data and output data are trained to determine the text extraction model, a typical recurrent neural network such as: Long Short-Term Memory (Long Short-Term Memory, referred to as "LSTM") model architecture.
构建完该文本提取模型后,将原始文本信息输入该文本提取模型,即可得到该文本特征数据。After the text extraction model is constructed, the original text information is input into the text extraction model to obtain the text feature data.
步骤S103:根据所述音频信息和文本信息结合场景信息生成行为模型。Step S103: Generate a behavior model according to the audio information and text information combined with scene information.
在步骤S103中,根据传入的音频内容和对应的文本信息,分析当前的语义、语境以及虚拟人像对话的上下文场景,根据音视频内容按照时间生成对应的口型,表情以及动作的行为模型。所述根据所述音频信息和文本信息结合场景信息生成行为模型,具体的,根据接收的所述音频信息和对应的所述文本信息,结合所述场景信息,根据所述虚拟人像行为按照时间生成对应的口型、表情以及动作的行为模型。其中场景信息包括所述音频信息的语义、语境以及所述虚拟人像行为的上下文场景。In step S103, according to the incoming audio content and corresponding text information, analyze the current semantics, context, and context scene of the avatar dialogue, and generate corresponding mouth shapes, expressions, and action behavior models according to time according to the audio and video content . The behavior model is generated according to the audio information and text information combined with scene information, specifically, based on the received audio information and the corresponding text information, combined with the scene information, and according to the behavior of the avatar according to time. Behavioral models of corresponding mouth shapes, facial expressions, and actions. The scene information includes the semantics and context of the audio information and the context scene of the avatar's behavior.
该行为模型是在获取对话信息之前,预先根据样本训练库中的各音频信息,以及与各音频信息对应的行为动作训练获得。由于用户与机器人在进行对话交互过程中,通常对话交互的一方产生的对话内容会影响交互对象另一方在聆听过程中的行为动作,且其中的强相关动作,例如口型,需要与音频信息同步,次相关的相关动作,同样需要与音频信息进行同步,所以该行为模型与音频信息具有对应关系。其中对于机器人行为中的口型训练尤为重要,通常采集大量的音视频信息,音视频信息中包含大量的音频信息和与之对应的口型动作,采用音视频中的虚拟人像中的音频与口型对应关系进行大数据训练,获得对应的口型动作,另外,机器人的口型动作通常可结合表情和肢体动作来同时进行。或者,交互对象的机器人自身发出的对话内容对自身的行为动作也会产生影响。The behavior model is obtained in advance according to the audio information in the sample training library and the behavior actions corresponding to the audio information before acquiring the dialogue information. During the dialogue interaction between the user and the robot, the content of the dialogue produced by one party of the dialogue interaction will affect the behavior of the other party during the listening process of the interactive object, and the strongly related actions, such as mouth shapes, need to be synchronized with the audio information , the related action of the second time also needs to be synchronized with the audio information, so the behavior model has a corresponding relationship with the audio information. Among them, mouth training in robot behavior is particularly important. Usually, a large amount of audio and video information is collected. The audio and video information contains a large amount of audio information and corresponding mouth movements. Big data training can be carried out based on the relationship between mouth shape and mouth shape, and the corresponding mouth shape movement can be obtained. In addition, the mouth movement of the robot can usually be combined with facial expressions and body movements at the same time. Alternatively, the dialogue content sent by the robot of the interactive object will also have an impact on its own behavior.
其中,样本训练库中的各对话信息,以及与各对话信息对应的行为动作 可以采用如下方式获取:Among them, each dialogue information in the sample training library, and the behavior actions corresponding to each dialogue information can be obtained in the following ways:
采集大量的音视频文件数据,获取大量的音频信息和音视频信息,如:采集4000个音视频文件。为了保证样本训练库中的各音频信息的准确性,可以采集包含有对话场景的音视频,例如,脱口秀节目的音视频文件,脱口秀节目通常仅有两个人进行对话,该情境类似于机器人与交互对象之间的对话情境,因此,将该脱口秀的音视频文件作为训练数据,可以准确训练出行为模型。Collect a large amount of audio and video file data, obtain a large amount of audio information and audio and video information, such as: collect 4000 audio and video files. In order to ensure the accuracy of each audio information in the sample training library, audio and video containing dialogue scenes can be collected, for example, audio and video files of talk shows. Usually, there are only two people talking in talk shows. This situation is similar to that of robots. Therefore, the audio and video files of the talk show can be used as training data to accurately train the behavior model.
由于每个音视频文件中均包含两个交互对象,每个音视频文件包含完整的对话场景,对每个音视频文件的处理过程为:可以通过语音识别的方式分别采集属于交互对象A的音频数据,以及属于交互对象B的音频数据,并将交互对象A的音频数据转换为文本数据,交互对象B的音频数据转为文本数据。并同时通过图像分析,采集交互对象A的行为动作A,以及属于交互对象B的行为动作。可以理解的是,交互对象A和交互对象B仅用于区别一个音视频文件中的两个交互对象,且每个音视频文件中的交互对象A(或交互对象B)可以是不同两个个体。Since each audio and video file contains two interactive objects, and each audio and video file contains a complete dialogue scene, the processing process for each audio and video file is: the audio belonging to the interactive object A can be collected separately through speech recognition data, and audio data belonging to interactive object B, and convert the audio data of interactive object A into text data, and convert the audio data of interactive object B into text data. At the same time, the behavior A of the interactive object A and the behavior of the interactive object B are collected through image analysis. It can be understood that interactive object A and interactive object B are only used to distinguish two interactive objects in an audio and video file, and interactive object A (or interactive object B) in each audio and video file can be different from two individuals .
步骤S104:将所述音频信息、所述文本信息以及所述行为模型结合时间节点进行关联,形成音频模型关联内容。Step S104: Associate the audio information, the text information, and the behavior model with time nodes to form audio model associated content.
在步骤S104中,本公开实施例中,将所述音频信息、所述文本信息以及所述行为模型结合时间节点进行关联,形成音频模型关联内容,具体包括:根据所述音频信息、所述音频信息对应的文本信息以及所述行为模型,通过起始时间和持续时间对应的关键时间节点进行关联,形成以时间节点进行关联的音频模型关联内容,所述音频模型关联内容中包括所述音频信息、文本信息、行为模型内容以及关联关系。其中行为模型中的行为按照与音频信息 的相关性进行类型排序,最相关的行为动作包含第一行为信息,本实施例中,与音频信息对应的口型动作为第一行为动作,与音频信息的关联性最强。与音频信息次相关的行为动作包含第二行为信息,本实施例中可将虚拟人像的表情动作作为第二行为动作,也可以采用肢体动作作为第二行为动作,其它行为以此按照与音频信息的相关度进行排序,本公开对此不做严格的限定。In step S104, in the embodiment of the present disclosure, the audio information, the text information, and the behavior model are associated with time nodes to form audio model associated content, which specifically includes: according to the audio information, the audio The text information corresponding to the information and the behavior model are associated through key time nodes corresponding to the start time and duration to form audio model associated content associated with time nodes, and the audio model associated content includes the audio information , text information, behavior model content and association relationship. Wherein, the behaviors in the behavior model are sorted according to the correlation with the audio information, and the most relevant behavior includes the first behavior information. In this embodiment, the mouth movement corresponding to the audio information is the first behavior, which is related to the audio information. the strongest correlation. The behavior related to the audio information contains the second behavior information. In this embodiment, the facial expression of the avatar can be used as the second behavior, or the body movement can be used as the second behavior. Other behaviors are based on the audio information. The correlation degree is sorted, which is not strictly limited in the present disclosure.
建立具有时间节点分布的时间轴,按照所述时间轴上的时间节点将音频信息,文本信息以及行为模型信息,通过所述时间节点进行关联,形成以所述时间节点相关联的关联关系,所述时间节点包括起始时间和持续时间对应的时间节点。其中,最相关的口型动作根据音频信息对应的起始时间和持续时间对应的时间点进行严格关联,依次相关的行为动作可以按照大致的时间节点与音频信息进行关联。该处的大致的时间节点可以设定一定的时间区间,例如在[-5s,+5s]、[-3s,+3s]、[-2s,+2s]、[-1s,+1s]、[-0.5s,+0.5s]等等。而不相关的行为动作可按照时间轴上的非关联时间点阵列进行关联。其中,所述非关联时间节点设置成等时间间隔或离散时间间隔。Establish a time axis with a time node distribution, and associate audio information, text information, and behavior model information through the time nodes according to the time nodes on the time axis to form an association relationship associated with the time nodes, so The above time nodes include time nodes corresponding to the start time and the duration. Among them, the most relevant mouth movements are strictly related according to the time points corresponding to the start time and duration of the audio information, and the sequentially related behaviors can be related to the audio information according to the approximate time node. The approximate time node here can set a certain time interval, for example, in [-5s, +5s], [-3s, +3s], [-2s, +2s], [-1s, +1s], [ -0.5s,+0.5s] and so on. Unrelated actions can be related by an array of unrelated time points on the time axis. Wherein, the non-associated time nodes are set as equal time intervals or discrete time intervals.
步骤S105:根据所述模型关联内容对虚拟人像行为进行驱动。Step S105: Drive the behavior of the avatar according to the associated content of the model.
在步骤S105中,通过所述模型关联内容中的所述行为模型信息对虚拟人像行为进行驱动。具体包括:首先将关联后的所述模型关联内容进行解析,获取所述音频信息、文本信息和行为模型信息,以及上述信息之间的关联关系;通过所述模型关联内容中的所述行为模型信息对虚拟人像行为进行驱动。通过所述音频信息和所述关联关系,进行音频信息和所述虚拟人像行为之间的驱动。按照相关度顺序,将所述行为模型信息中与所述音频信息最相关的第一行为信息通过起始时间和持续时间对应的时间节点与所述音频信息进行驱动。然后将所述行为模型信息中与所述音频信息次相关的第二行为信息通过起始时间和持续时间对应的时间节点与所述音频信息进行驱动;以此按照相关顺序,将所述行为模型信息中与所述音频信息依次相关的第N行为信息 通过起始时间和持续时间对应的时间节点与所述音频信息进行驱动;所述N为大于1的自然数。最后将所述行为模型信息中与所述音频信息不相关的行为信息按照所述非关联时间节点阵列进行驱动。所述非关联时间节点设置成等时间间隔或离散时间间隔。所述最相关的第一行为信息为口型行为信息;所述次相关的第二行为信息为表情行为信息;所述依次相关的第N行为信息包括肢体动作行为信息。需要说明的是,本实施例的相关顺序并非严格的相关顺序,本领域技术人员可对相关的行为顺序进行调整,本公开对此不做严格限定。In step S105, the behavior of the avatar is driven by the behavior model information in the model-associated content. It specifically includes: first analyzing the associated content of the model after association, obtaining the audio information, text information and behavior model information, and the association relationship between the above information; through the behavior model in the model association content The information drives the behavior of the avatar. Through the audio information and the association relationship, the driving between the audio information and the behavior of the avatar is performed. According to the order of correlation, the first behavior information most related to the audio information in the behavior model information is driven with the audio information through the time node corresponding to the start time and the duration. Then, the second behavior information related to the audio information in the behavior model information is driven through the time node corresponding to the start time and duration and the audio information; thus, the behavior model is The Nth line of information in the information that is sequentially related to the audio information is driven by the time node corresponding to the start time and the duration of the audio information; the N is a natural number greater than 1. Finally, the behavior information that is not related to the audio information in the behavior model information is driven according to the non-associated time node array. The non-associated time nodes are set as equal time intervals or discrete time intervals. The most relevant first behavior information is lip behavior information; the second most relevant behavior information is expression behavior information; and the sequentially related Nth behavior information includes body movement behavior information. It should be noted that the related sequence in this embodiment is not a strict related sequence, and those skilled in the art can adjust the related behavior sequence, which is not strictly limited in the present disclosure.
步骤S106:进行音频信息和所述虚拟人像行为之间的同步。Step S106: Perform synchronization between the audio information and the behavior of the avatar.
在步骤S106中,通过上一步骤中建立的具有时间节点分布的时间轴,按照所述时间轴上的时间节点将音频信息,文本信息以及行为模型信息,通过所述时间节点进行关联,形成以所述时间节点相关联的关联关系;通过所述音频信息和所述关联关系,进行音频信息和所述虚拟人像行为之间的同步;所述时间节点包括起始时间和持续时间对应的时间节点。所述进行音频信息和所述虚拟人像行为之间的同步,进一步包括:将所述音频信息划分为多个片段,每个片段有各自的起始时间和持续时间,将对应的所述文本信息与所述音频信息通过起始时间和持续时间对应的时间节点进行同步,将所述行为模型信息中与所述音频信息最相关的第一行为信息通过起始时间和持续时间对应的时间节点与所述音频信息进行同步。将所述行为模型信息中与所述音频信息次相关的第二行为信息通过起始时间和持续时间对应的时间节点与所述音频信息进行同步;以此按照相关顺序,将所述行为模型信息中与所述音频信息依次相关的第N行为信息通过起始时间和持续时间对应的时间节点与所述音频信息进行同步;所述N为大于1的自然数。另外,在所述时间轴上设置非关联时间节点阵列;将所述行为模型信息中与所述音频信息不相关的行为信息按照所述非关联时间节点阵列进行驱动。所述非关联时间节点设置成等时间间隔或离散时间间隔。所述最相关的第一行为信息为口型行为信息; 所述次相关的第二行为信息为表情行为信息;所述依次相关的第N行为信息包括肢体动作行为信息。需要说明的是,本实施例的相关顺序并非严格的相关顺序,本领域技术人员可对相关的行为顺序进行调整,本公开对此不做严格限定。In step S106, through the time axis with time node distribution established in the previous step, the audio information, text information and behavior model information are associated according to the time nodes on the time axis through the time nodes to form the following The association relationship associated with the time node; through the audio information and the association relationship, the synchronization between the audio information and the behavior of the avatar is performed; the time node includes the time node corresponding to the start time and duration . The synchronization between the audio information and the behavior of the avatar further includes: dividing the audio information into a plurality of segments, each segment has its own start time and duration, and the corresponding text information Synchronize with the audio information through the time node corresponding to the start time and duration, and combine the first behavior information in the behavior model information most related to the audio information through the time node corresponding to the start time and duration with The audio information is synchronized. Synchronize the second behavior information in the behavior model information that is secondarily related to the audio information with the audio information through the time node corresponding to the start time and the duration; thus, according to the relevant order, the behavior model information The Nth behavior information sequentially related to the audio information is synchronized with the audio information through the time node corresponding to the start time and the duration; the N is a natural number greater than 1. In addition, a non-associated time node array is set on the time axis; behavior information in the behavior model information that is not related to the audio information is driven according to the non-associated time node array. The non-associated time nodes are set as equal time intervals or discrete time intervals. The most relevant first behavior information is lip behavior information; the second most relevant behavior information is expression behavior information; and the sequentially related Nth behavior information includes body movement behavior information. It should be noted that the related sequence in this embodiment is not a strict related sequence, and those skilled in the art can adjust the related behavior sequence, which is not strictly limited in the present disclosure.
图2为本公开一实施例提供的音频模型关联内容结构示意图。其中该音频模型关联内容为图1中音频模型文件生成模块合成的文件;Fig. 2 is a schematic diagram of the audio model related content structure provided by an embodiment of the present disclosure. Wherein the associated content of the audio model is the file synthesized by the audio model file generation module in Fig. 1;
根据音频内容,文本以及模型进行分片,每个分片中均包含了音频,文本,行文模型以及各个数据在时间节点上的关联。Fragmentation is performed according to audio content, text, and models. Each shard contains audio, text, text models, and the association of each data at time nodes.
对应在图1中合成的音频模型文件如图2中所示一般由多个分片组成,每个分片具有相同的数据结构和不同的数据内容。以分片1为例进行说明,包含:The audio model file synthesized correspondingly in FIG. 1 generally consists of multiple slices as shown in FIG. 2 , and each slice has the same data structure and different data content. Take slice 1 as an example for illustration, including:
音频片段1:为音频文件中第一片音频数据,音频片段的分片可以为固定大小,或者按照音频内容进行划分,对于划分方式不做限制;Audio segment 1: It is the first piece of audio data in the audio file. The segment of the audio segment can be a fixed size, or divided according to the audio content, and there is no restriction on the division method;
文本:为音频片段所对应的文本内容;Text: the text content corresponding to the audio clip;
行为模型:为音频片段以及场景所对应的行为模型,包括表情,肢体,嘴型模型数据,包括但不限于这几种行为模型;Behavior model: the behavior model corresponding to audio clips and scenes, including expression, body, and mouth model data, including but not limited to these types of behavior models;
开始时间:该时间为对应上述各个元素开始的时间;对应上述各个元素均有各自的起始时间,此处进行统一介绍说明;Start time: This time is the start time corresponding to each of the above elements; corresponding to each of the above elements has its own start time, here is a unified introduction;
结束时间:该时间为对应上述各个元素结束的时间;对应上述各个元素均有各自的结束时间,此处进行统一介绍说明;End time: This time is the end time corresponding to each of the above elements; corresponding to each of the above elements has its own end time, here is a unified introduction;
持续时间:该时间为对应上述各个元素持续的时间;对应上述各个元素均有各自的持续时间,此处进行统一介绍说明。Duration: This time is the duration corresponding to each of the above elements; corresponding to each of the above elements has its own duration, here is a unified introduction.
图3为本公开一实施例提供的音频信息的数据格式示意图。FIG. 3 is a schematic diagram of a data format of audio information provided by an embodiment of the present disclosure.
该音频信息的数据格式包括:片段索引、片段大小、起始时间、结束时间、文本信息、音频大小、音频信息以模型内容。其中模型内容包括行为信息以及关联关系,在此,着重对模型内容进行描述。由于在图3中已经说明了音频模型内容的构成,图3中对图2中的模型内容的定义及分片进行详细的说明和描述。The data format of the audio information includes: segment index, segment size, start time, end time, text information, audio size, audio information and model content. The content of the model includes behavior information and association relationships. Here, the content of the model is described emphatically. Since the composition of the audio model content has been explained in FIG. 3 , the definition and fragmentation of the model content in FIG. 2 are illustrated and described in detail in FIG. 3 .
对于模型内容的定义:For the definition of model content:
Figure PCTCN2022084697-appb-000001
Figure PCTCN2022084697-appb-000001
Figure PCTCN2022084697-appb-000002
Figure PCTCN2022084697-appb-000002
Figure PCTCN2022084697-appb-000003
Figure PCTCN2022084697-appb-000003
其中MoudleData中包含LipSync(嘴型模型),Expression(表情模型),Action(肢体动作模型),其中每个模型当中会进一步说明该模型详细内容:Name(细分动作的名称),Start(起始时间),End(结束时间),以及Data(该细分动作的模型数据)。Among them, MoudleData contains LipSync (mouth model), Expression (expression model), Action (body movement model), and each model will further explain the detailed content of the model: Name (the name of the subdivision action), Start (the start time), End (end time), and Data (model data of the subdivision action).
以图中的LipSync为例,其中“LipSync”为嘴型模型的名称,该嘴型模型中包含了多个不同的嘴型模型元素X,Y等等,每个模型元素包含:Take LipSync in the figure as an example, where "LipSync" is the name of the mouth model, which contains a number of different mouth model elements X, Y, etc., each model element contains:
Name:模型名称,区分不同的模型;Name: model name, to distinguish different models;
Start:对应模型开始的时间;Start: the time corresponding to the start of the model;
End:对应模型结束的时间;End: the time corresponding to the end of the model;
Data:对应的模型数据;Data: the corresponding model data;
图4示出了本公开一实施例提供的音频模型关联内容各元素进行时间同步的示意图。该图形为音频模型文件各个元素同步的示意图,其中包含了时间(该音频模型文件的工作的时间),音频(音频的各个分片),口型,文本,肢体动作,表情,以及各个元素在时间轴上的分布。Fig. 4 shows a schematic diagram of time synchronization of elements of audio model-associated content provided by an embodiment of the present disclosure. This graphic is a schematic diagram of the synchronization of each element of the audio model file, which includes time (the working time of the audio model file), audio (each segment of the audio), mouth shape, text, body movement, expression, and each element in the distribution on the time axis.
以“你好”为例进行,对应的音频以字划分为两个片段,每个片段有各自的时间描述(起始和持续时间);对应的文本的描述时间和音频相关联具有一样的时间描述;口型模型和对应的音频内容相关,比如图4中第一个音频片段对应了1,2,3口型模型;在第一个音频的播放过程中进行了挥手的动作行为,同时在眨眼的行为。所有的元素均有按照同一个时间轴进行工作。Taking "Hello" as an example, the corresponding audio is divided into two segments by words, and each segment has its own time description (start and duration); the description time of the corresponding text has the same time associated with the audio Description; the mouth shape model is related to the corresponding audio content, for example, the first audio clip in Figure 4 corresponds to the 1, 2, and 3 mouth shape models; the action behavior of waving is performed during the playback of the first audio, and at the same time The act of blinking. All elements work according to the same timeline.
图5为本公开另一实施例提供的音频驱动虚拟人像行为的装置示意图。该音频驱动虚拟人像行为的装置包括:接收模块501、文本信息生成模块502、行为模型生成模块503、音频模型关联模块504、驱动模块505和同步模块 506。其中:FIG. 5 is a schematic diagram of an audio-driven device for avatar behavior provided by another embodiment of the present disclosure. The audio-driven device for avatar behavior includes: a receiving module 501 , a text information generating module 502 , a behavior model generating module 503 , an audio model association module 504 , a driving module 505 and a synchronization module 506 . in:
所述接收模块501,用于接收音频信息。The receiving module 501 is configured to receive audio information.
智能设备接收音频信息,本实施例中的智能设备以智能机器人为例,其具有拟人的形态,头部的显示屏上具有虚拟人像的五官显示,在智能设备接收到音频信息后,和配合虚拟人像中的口型,同步地将对应的语音播放出来,同时可配合机器人的虚拟人像的拟人表情,比如伤心、大笑、微笑、大哭、无奈、尴尬等表情。另外,该机器人还可实现其它行为,例如摆手、摊手、摇头、点头等,也可根据音频信息,配合虚拟人像口型和表情同步表现出来。本实施例中的接收音频信息,可以采用用户与智能机器人交互时,智能机器人实时采集用户的语音信息,将其作为音频信息的来源,也可调取外部或内部存储设备中的音频信息,音频信息的来源不限于此。The smart device receives audio information. The smart device in this embodiment takes an intelligent robot as an example. It has an anthropomorphic form, and the facial features of a virtual person are displayed on the display screen of the head. After the smart device receives the audio information, it cooperates with the virtual The mouth shape in the portrait will play the corresponding voice synchronously, and at the same time, it can match the anthropomorphic expressions of the robot's virtual portrait, such as sad, laughing, smiling, crying, helpless, embarrassing and other expressions. In addition, the robot can also realize other behaviors, such as waving hands, spreading hands, shaking the head, nodding, etc., and can also perform synchronously with the mouth shape and expression of the virtual portrait according to the audio information. The receiving audio information in this embodiment can adopt when the user interacts with the intelligent robot, the intelligent robot collects the voice information of the user in real time, and uses it as the source of the audio information, and also can call the audio information in the external or internal storage device, the audio The source of information is not limited to this.
所述文本信息生成模块502,用于根据所述音频信息生成文本信息。The text information generation module 502 is configured to generate text information according to the audio information.
用户与智能机器人对话交互的过程中涉及音频信息的输入和接收,用户与机器人的对话中包括涉及音频的对话信息,对话信息包括对话内容的特征数据,则获取对话信息的过程即为确定对话内容的特征数据的过程:获取原始文本信息,原始文本信息为对话内容所对应的文本信息;从该原始文本信息中提取文本特征数据;将文本特征数据,作为对话内容的特征数据。The process of dialogue and interaction between the user and the intelligent robot involves the input and reception of audio information. The dialogue between the user and the robot includes dialogue information involving audio, and the dialogue information includes the characteristic data of the dialogue content. The process of obtaining dialogue information is to determine the dialogue content. The process of feature data: obtaining original text information, which is the text information corresponding to the dialogue content; extracting text feature data from the original text information; using the text feature data as feature data of the dialogue content.
所述行为模型生成模块503,用于根据所述音频信息和文本信息结合场景信息生成行为模型。The behavior model generation module 503 is configured to generate a behavior model according to the audio information and text information combined with scene information.
根据传入的音频内容和对应的文本信息,分析当前的语义、语境以及虚拟人像对话的上下文场景,根据音视频内容按照时间生成对应的口型,表情以及动作的行为模型。所述根据所述音频信息和文本信息结合场景信息生成行为模型,具体的,根据接收的所述音频信息和对应的所述文本信息,结合所述场景信息,根据所述虚拟人像行为按照时间生成对应的口型、表情以及 动作的行为模型。其中场景信息包括所述音频信息的语义、语境以及所述虚拟人像行为的上下文场景。According to the incoming audio content and the corresponding text information, analyze the current semantics, context and the context scene of the avatar dialogue, and generate the corresponding mouth shape, expression and action behavior model according to the time according to the audio and video content. The behavior model is generated according to the audio information and text information combined with scene information, specifically, based on the received audio information and the corresponding text information, combined with the scene information, and according to the behavior of the avatar according to time. Behavioral models of corresponding mouth shapes, facial expressions, and actions. The scene information includes the semantics and context of the audio information and the context scene of the avatar's behavior.
所述音频模型关联模块504,用于将所述音频信息、所述文本信息以及所述行为模型结合时间节点进行关联,形成音频模型关联内容。The audio model association module 504 is configured to associate the audio information, the text information, and the behavior model with time nodes to form audio model associated content.
将所述音频信息、所述文本信息以及所述行为模型结合时间节点进行关联,形成音频模型关联内容,具体包括:根据所述音频信息、所述音频信息对应的文本信息以及所述行为模型,通过起始时间和持续时间对应的关键时间节点进行关联,形成以时间节点进行关联的音频模型关联内容,所述音频模型关联内容中包括所述音频信息、文本信息、行为模型内容以及关联关系。其中行为模型中的行为按照与音频信息的相关性进行类型排序,最相关的行为动作包含第一行为信息,本实施例中,与音频信息对应的口型动作为第一行为动作,与音频信息的关联性最强。与音频信息次相关的行为动作包含第二行为信息,本实施例中可将虚拟人像的表情动作作为第二行为动作,也可以采用肢体动作作为第二行为动作,其它行为以此按照与音频信息的相关度进行排序,本公开对此不做严格的限定。建立具有时间节点分布的时间轴,按照所述时间轴上的时间节点将音频信息,文本信息以及行为模型信息,通过所述时间节点进行关联,形成以所述时间节点相关联的关联关系,所述时间节点包括起始时间和持续时间对应的时间节点。Associating the audio information, the text information, and the behavior model with time nodes to form audio model associated content, specifically including: according to the audio information, the text information corresponding to the audio information, and the behavior model, The key time nodes corresponding to the start time and duration are associated to form audio model associated content associated with time nodes, and the audio model associated content includes the audio information, text information, behavior model content, and association relationship. Wherein the behaviors in the behavior model are sorted according to the correlation with the audio information, and the most relevant behavior action includes the first behavior information. In this embodiment, the mouth movement corresponding to the audio information is the first behavior action, which is related to the audio information. the strongest correlation. The behaviors related to the audio information include the second behavior information. In this embodiment, the facial expressions of the avatar can be used as the second behavior, or body movements can be used as the second behavior. Other behaviors are based on the audio information. The correlation degree is sorted, which is not strictly limited in the present disclosure. Establish a time axis with a time node distribution, and associate audio information, text information, and behavior model information through the time nodes according to the time nodes on the time axis to form an association relationship associated with the time nodes, so The above time nodes include time nodes corresponding to the start time and the duration.
所述驱动模块505,用于根据所述模型关联内容对虚拟人像行为进行驱动。The driving module 505 is configured to drive the behavior of the avatar according to the associated content of the model.
首先将关联后的所述模型关联内容进行解析,获取所述音频信息、文本信息和行为模型信息,以及上述信息之间的关联关系;通过所述模型关联内容中的所述行为模型信息对虚拟人像行为进行驱动。通过所述音频信息和所述关联关系,进行音频信息和所述虚拟人像行为之间的驱动。按照相关度顺序,将所述行为模型信息中与所述音频信息最相关的第一行为信息通过起始 时间和持续时间对应的时间节点与所述音频信息进行驱动。然后将所述行为模型信息中与所述音频信息次相关的第二行为信息通过起始时间和持续时间对应的时间节点与所述音频信息进行驱动;以此按照相关顺序,将所述行为模型信息中与所述音频信息依次相关的第N行为信息通过起始时间和持续时间对应的时间节点与所述音频信息进行驱动;所述N为大于1的自然数。最后将所述行为模型信息中与所述音频信息不相关的行为信息按照所述非关联时间节点阵列进行驱动。所述非关联时间节点设置成等时间间隔或离散时间间隔。所述最相关的第一行为信息为口型行为信息;所述次相关的第二行为信息为表情行为信息;所述依次相关的第N行为信息包括肢体动作行为信息。需要说明的是,本实施例的相关顺序并非严格的相关顺序,本领域技术人员可对相关的行为顺序进行调整,本公开对此不做严格限定。Firstly, analyze the associated content of the model after association, obtain the audio information, text information and behavior model information, and the association relationship between the above information; use the behavior model information in the model associated content to virtual Portrait behavior is driven. Through the audio information and the association relationship, the driving between the audio information and the behavior of the avatar is performed. According to the order of correlation, drive the first behavior information in the behavior model information most related to the audio information through the time node corresponding to the start time and duration and the audio information. Then, the second behavior information related to the audio information in the behavior model information is driven through the time node corresponding to the start time and duration and the audio information; thus, the behavior model is The Nth line of information in the information that is sequentially related to the audio information is driven by the time node corresponding to the start time and the duration of the audio information; the N is a natural number greater than 1. Finally, the behavior information that is not related to the audio information in the behavior model information is driven according to the non-associated time node array. The non-associated time nodes are set as equal time intervals or discrete time intervals. The most relevant first behavior information is lip behavior information; the second most relevant behavior information is expression behavior information; and the sequentially related Nth behavior information includes body movement behavior information. It should be noted that the related sequence in this embodiment is not a strict related sequence, and those skilled in the art can adjust the related behavior sequence, which is not strictly limited in the present disclosure.
所述同步模块506,用于进行音频信息和所述虚拟人像行为之间的同步。The synchronization module 506 is used for synchronizing the audio information and the behavior of the avatar.
按照所述时间轴上的时间节点将音频信息,文本信息以及行为模型信息,通过所述时间节点进行关联,形成以所述时间节点相关联的关联关系;通过所述音频信息和所述关联关系,进行音频信息和所述虚拟人像行为之间的同步;所述时间节点包括起始时间和持续时间对应的时间节点。According to the time nodes on the time axis, audio information, text information and behavior model information are associated through the time nodes to form an association relationship associated with the time nodes; through the audio information and the association relationship , performing synchronization between the audio information and the behavior of the avatar; the time node includes a time node corresponding to a start time and a duration.
所述同步模块506,进一步用于:将所述音频信息划分为多个片段,每个片段有各自的起始时间和持续时间,将对应的所述文本信息与所述音频信息通过起始时间和持续时间对应的时间节点进行同步,将所述行为模型信息中与所述音频信息最相关的第一行为信息通过起始时间和持续时间对应的时间节点与所述音频信息进行同步。The synchronization module 506 is further configured to: divide the audio information into a plurality of segments, each segment has its own start time and duration, and pass the corresponding text information and the audio information through the start time The time node corresponding to the duration is synchronized, and the first behavior information most related to the audio information in the behavior model information is synchronized with the audio information through the time node corresponding to the start time and the duration.
所述同步模块506,进一步用于:将所述行为模型信息中与所述音频信息次相关的第二行为信息通过起始时间和持续时间对应的时间节点与所述音频信息进行同步;以此按照相关顺序,将所述行为模型信息中与所述音频信息依次相关的第N行为信息通过起始时间和持续时间对应的时间节点与所述 音频信息进行同步;所述N为大于1的自然数。The synchronization module 506 is further configured to: synchronize the second behavior information in the behavior model information that is secondarily related to the audio information with the audio information through the time node corresponding to the start time and duration; According to the relevant order, the Nth behavior information in the behavior model information that is sequentially related to the audio information is synchronized with the audio information through the time node corresponding to the start time and duration; the N is a natural number greater than 1 .
所述同步模块506,进一步用于:在所述时间轴上设置非关联时间节点阵列;将所述行为模型信息中与所述音频信息不相关的行为信息按照所述非关联时间节点阵列进行驱动。所述非关联时间节点设置成等时间间隔或离散时间间隔。The synchronization module 506 is further configured to: set a non-associated time node array on the time axis; drive behavior information in the behavior model information that is not related to the audio information according to the non-associated time node array . The non-associated time nodes are set as equal time intervals or discrete time intervals.
所述最相关的第一行为信息为口型行为信息;所述次相关的第二行为信息为表情行为信息;所述依次相关的第N行为信息包括肢体动作行为信息。需要说明的是,本实施例的相关顺序并非严格的相关顺序,本领域技术人员可对相关的行为顺序进行调整,本公开对此不做严格限定。The most relevant first behavior information is lip behavior information; the second most relevant behavior information is expression behavior information; and the sequentially related Nth behavior information includes body movement behavior information. It should be noted that the related sequence in this embodiment is not a strict related sequence, and those skilled in the art can adjust the related behavior sequence, which is not strictly limited in the present disclosure.
图5所示装置可以执行图1所示实施例的方法,本实施例未详细描述的部分,可参考对图1所示实施例的相关说明。该技术方案的执行过程和技术效果参见图1所示实施例中的描述,在此不再赘述。The device shown in FIG. 5 can execute the method of the embodiment shown in FIG. 1 . For parts not described in detail in this embodiment, reference can be made to relevant descriptions of the embodiment shown in FIG. 1 . For the execution process and technical effect of this technical solution, refer to the description in the embodiment shown in FIG. 1 , and details are not repeated here.
如图6所示,示出了本公开一实施例提供的音频驱动虚拟人像行为的系统示意图。该系统组成的示意图:其中该系统示意图结合附图5中的音频驱动虚拟人像行为的装置图,根据各模块之间的逻辑关系,根据该系统图,示出了行为模型生成模块,音频模型文件生成模块,音频模型解析模块,虚拟人像行为驱动模块之间的逻辑关系。As shown in FIG. 6 , it shows a system schematic diagram of an audio-driven avatar behavior provided by an embodiment of the present disclosure. A schematic diagram of the system composition: where the system schematic diagram is combined with the device diagram of audio-driven avatar behavior in accompanying drawing 5, according to the logical relationship between the modules, according to the system diagram, the behavior model generation module and the audio model file are shown The logical relationship between the generation module, the audio model analysis module, and the avatar behavior driver module.
行为模型生成模块,根据传入的音频内容和对应的文本信息,分析当前的语义、语境以及虚拟人像对话的上下文场景,根据视频内容按照时间生成对应的口型,表情以及动作的行为模型。The behavior model generation module analyzes the current semantics, context and avatar dialogue context scene according to the incoming audio content and corresponding text information, and generates corresponding mouth shapes, expressions and action behavior models according to time according to the video content.
音频模型文件生成模块,根据音频内容,音频内容对应的文本,以及在模型生成模块中生成的模型数据,按照音频数播放的时间节点将音频,文本以及模型,按照各个元素在音频播放中对应节点的起始时间,持续时间等关键时间节点进行关联,形成以时间节点进行关联的音频模型文件;该文件中 包括了时间节点(起始时间,持续时间),音频文件,文本,以及模型信息。该音频模型文件生成模块对应于图6中的文本信息生成模块和音频模型关联模块的部分功能。The audio model file generation module, according to the audio content, the text corresponding to the audio content, and the model data generated in the model generation module, according to the time node of the audio number playback, the audio, text and model, according to each element in the corresponding node in the audio playback The start time, duration and other key time nodes are associated to form an audio model file associated with time nodes; the file includes time nodes (start time, duration), audio files, text, and model information. The audio model file generation module corresponds to some functions of the text information generation module and the audio model association module in FIG. 6 .
音频模型文件解析模块,将音频模型文件进行解析,分别获得音频内容,模型内容,文本内容,以及上述内容之间的关联,包含但不限于时间的关联。该音频模型文件解析模块对应于图6中的音频模型关联模块。The audio model file parsing module parses the audio model file to obtain audio content, model content, text content, and associations among the above contents, including but not limited to time associations. The audio model file parsing module corresponds to the audio model association module in FIG. 6 .
虚拟人像行为驱动模块,根据在音频解析模块中解析到的模型内容进行虚拟人像行为的驱动。其中通过音频内容和模型之间的关联,进行音频播放和动作模型之间的同步。该虚拟人像行为驱动模块对应于图6中的驱动模块。The avatar behavior driving module drives the avatar behavior according to the model content analyzed in the audio analysis module. The synchronization between the audio playback and the action model is performed through the association between the audio content and the model. The avatar behavior driving module corresponds to the driving module in FIG. 6 .
音频驱动虚拟人像行为的装置,进一步包括:A device for audio-driven avatar behavior, further comprising:
音频播放模块,播放上述音频模型文件解析模块中解析的音频内容。The audio playing module plays the audio content parsed in the above audio model file parsing module.
文本显示模块,显示上述音频模型文件解析模块中解析的文本内容。The text display module displays the text content parsed in the above-mentioned audio model file parsing module.
下面参考图7,其示出了适于用来实现本公开另一实施例的电子设备700的结构示意图。本公开实施例中的终端设备可以包括但不限于诸如移动电话、笔记本电脑、数字广播接收器、PDA(个人数字助理)、PAD(平板电脑)、PMP(便携式多媒体播放器)、车载终端(例如车载导航终端)等等的移动终端以及诸如数字TV、台式计算机等等的固定终端。图7示出的电子设备仅仅是一个示例,不应对本公开实施例的功能和使用范围带来任何限制。Referring now to FIG. 7 , it shows a schematic structural diagram of an electronic device 700 suitable for implementing another embodiment of the present disclosure. The terminal equipment in the embodiment of the present disclosure may include but not limited to such as mobile phone, notebook computer, digital broadcast receiver, PDA (personal digital assistant), PAD (tablet computer), PMP (portable multimedia player), vehicle terminal (such as mobile terminals such as car navigation terminals) and fixed terminals such as digital TVs, desktop computers and the like. The electronic device shown in FIG. 7 is only an example, and should not limit the functions and application scope of the embodiments of the present disclosure.
如图7所示,电子设备700可以包括处理装置(例如中央处理器、图形处理器等)701,其可以根据存储在只读存储器(ROM)702中的程序或者从存储装置708加载到随机访问存储器(RAM)703中的程序而执行各种适当的动作和处理。在RAM 703中,还存储有电子设备700操作所需的各种程序和数据。处理装置701、ROM 702以及RAM 703通过通信线路704彼此相连。输入/输出(I/O)接口705也连接至通信线路704。As shown in FIG. 7 , an electronic device 700 may include a processing device (such as a central processing unit, a graphics processing unit, etc.) Various appropriate actions and processes are executed by programs in the memory (RAM) 703 . In the RAM 703, various programs and data necessary for the operation of the electronic device 700 are also stored. The processing device 701, ROM 702, and RAM 703 are connected to each other through a communication line 704. An input/output (I/O) interface 705 is also connected to the communication line 704 .
通常,以下装置可以连接至I/O接口705:包括例如触摸屏、触摸板、键盘、鼠标、摄像头、麦克风、加速度计、陀螺仪等的输入装置706;包括例如液晶显示器(LCD)、扬声器、振动器等的输出装置707;包括例如磁带、硬盘等的存储装置708;以及通信装置709。通信装置709可以允许电子设备700与其他设备进行无线或有线通信以交换数据。虽然图7示出了具有各种装置的电子设备700,但是应理解的是,并不要求实施或具备所有示出的装置。可以替代地实施或具备更多或更少的装置。Typically, the following devices can be connected to the I/O interface 705: input devices 706 including, for example, a touch screen, touchpad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, etc.; including, for example, a liquid crystal display (LCD), speaker, vibration an output device 707 such as a computer; a storage device 708 including, for example, a magnetic tape, a hard disk, etc.; and a communication device 709. The communication means 709 may allow the electronic device 700 to communicate with other devices wirelessly or by wire to exchange data. While FIG. 7 shows electronic device 700 having various means, it should be understood that implementing or having all of the means shown is not a requirement. More or fewer means may alternatively be implemented or provided.
特别地,根据本公开的实施例,上文参考流程图描述的过程可以被实现为计算机软件程序。例如,本公开的实施例包括一种计算机程序产品,其包括承载在非暂态计算机可读介质上的计算机程序,该计算机程序包含用于执行流程图所示的方法的程序代码。在这样的实施例中,该计算机程序可以通过通信装置709从网络上被下载和安装,或者从存储装置708被安装,或者从ROM 702被安装。在该计算机程序被处理装置701执行时,执行本公开实施例的方法中限定的上述功能。In particular, according to an embodiment of the present disclosure, the processes described above with reference to the flowcharts can be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product, which includes a computer program carried on a non-transitory computer readable medium, where the computer program includes program code for executing the method shown in the flowchart. In such an embodiment, the computer program may be downloaded and installed from a network via communication means 709, or from storage means 708, or from ROM 702. When the computer program is executed by the processing device 701, the above-mentioned functions defined in the methods of the embodiments of the present disclosure are performed.
需要说明的是,本公开上述的计算机可读介质可以是计算机可读信号介质或者计算机可读存储介质或者是上述两者的任意组合。计算机可读存储介质例如可以是——但不限于——电、磁、光、电磁、红外线、或半导体的系统、装置或器件,或者任意以上的组合。计算机可读存储介质的更具体的例子可以包括但不限于:具有一个或多个导线的电连接、便携式计算机磁盘、硬盘、随机访问存储器(RAM)、只读存储器(ROM)、可擦式可编程只读存储器(EPROM或闪存)、光纤、便携式紧凑磁盘只读存储器(CD-ROM)、光存储器件、磁存储器件、或者上述的任意合适的组合。在本公开中,计算机可读存储介质可以是任何包含或存储程序的有形介质,该程序可以被指令执行系统、装置或者器件使用或者与其结合使用。而在本公开中,计算机可读信号介质可以包括在基带中或者作为载波一部分传播的数据信号,其中承载了计算机可读的程序代码。这种传播的数据信号可以采用多种形式,包括 但不限于电磁信号、光信号或上述的任意合适的组合。计算机可读信号介质还可以是计算机可读存储介质以外的任何计算机可读介质,该计算机可读信号介质可以发送、传播或者传输用于由指令执行系统、装置或者器件使用或者与其结合使用的程序。计算机可读介质上包含的程序代码可以用任何适当的介质传输,包括但不限于:电线、光缆、RF(射频)等等,或者上述的任意合适的组合。It should be noted that the above-mentioned computer-readable medium in the present disclosure may be a computer-readable signal medium or a computer-readable storage medium or any combination of the above two. A computer readable storage medium may be, for example, but not limited to, an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination thereof. More specific examples of computer-readable storage media may include, but are not limited to, electrical connections with one or more wires, portable computer diskettes, hard disks, random access memory (RAM), read-only memory (ROM), erasable Programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), optical storage device, magnetic storage device, or any suitable combination of the above. In the present disclosure, a computer-readable storage medium may be any tangible medium that contains or stores a program that can be used by or in conjunction with an instruction execution system, apparatus, or device. In the present disclosure, however, a computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave carrying computer-readable program code therein. Such propagated data signals may take many forms, including but not limited to electromagnetic signals, optical signals, or any suitable combination of the foregoing. A computer-readable signal medium may also be any computer-readable medium other than a computer-readable storage medium, which can transmit, propagate, or transmit a program for use by or in conjunction with an instruction execution system, apparatus, or device . Program code embodied on a computer readable medium may be transmitted by any appropriate medium, including but not limited to wires, optical cables, RF (radio frequency), etc., or any suitable combination of the above.
在一些实施方式中,客户端、服务器可以利用诸如HTTP(HyperText Transfer Protocol,超文本传输协议)之类的任何当前已知或未来研发的网络协议进行通信,并且可以与任意形式或介质的数字数据通信(例如,通信网络)互连。通信网络的示例包括局域网(“LAN”),广域网(“WAN”),网际网(例如,互联网)以及端对端网络(例如,ad hoc端对端网络),以及任何当前已知或未来研发的网络。In some embodiments, the client and the server can communicate using any currently known or future network protocols such as HTTP (HyperText Transfer Protocol, Hypertext Transfer Protocol), and can communicate with digital data in any form or medium The communication (eg, communication network) interconnections. Examples of communication networks include local area networks ("LANs"), wide area networks ("WANs"), internetworks (e.g., the Internet), and peer-to-peer networks (e.g., ad hoc peer-to-peer networks), as well as any currently known or future developed network of.
上述计算机可读介质可以是上述电子设备中所包含的;也可以是单独存在,而未装配入该电子设备中。The above-mentioned computer-readable medium may be included in the above-mentioned electronic device, or may exist independently without being incorporated into the electronic device.
上述计算机可读介质承载有一个或者多个程序,当上述一个或者多个程序被该电子设备执行时,使得该电子设备:执行上述实施例中的交互方法。The above-mentioned computer-readable medium carries one or more programs, and when the above-mentioned one or more programs are executed by the electronic device, the electronic device: executes the interaction method in the above-mentioned embodiment.
可以以一种或多种程序设计语言或其组合来编写用于执行本公开的操作的计算机程序代码,上述程序设计语言包括但不限于面向对象的程序设计语言—诸如Java、Smalltalk、C++,还包括常规的过程式程序设计语言—诸如“C”语言或类似的程序设计语言。程序代码可以完全地在用户计算机上执行、部分地在用户计算机上执行、作为一个独立的软件包执行、部分在用户计算机上部分在远程计算机上执行、或者完全在远程计算机或服务器上执行。在涉及远程计算机的情形中,远程计算机可以通过任意种类的网络——包括局域网(LAN)或广域网(WAN)—连接到用户计算机,或者,可以连接到外部计算机(例如利用因特网服务提供商来通过因特网连接)。Computer program code for carrying out operations of the present disclosure may be written in one or more programming languages, or combinations thereof, including but not limited to object-oriented programming languages—such as Java, Smalltalk, C++, and Includes conventional procedural programming languages - such as the "C" language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In cases involving a remote computer, the remote computer can be connected to the user computer through any kind of network, including a local area network (LAN) or a wide area network (WAN), or it can be connected to an external computer (such as through an Internet service provider). Internet connection).
附图中的流程图和框图,图示了按照本公开各种实施例的系统、方法和计算机程序产品的可能实现的体系架构、功能和操作。在这点上,流程图或框图中的每个方框可以代表一个模块、程序段、或代码的一部分,该模块、程序段、或代码的一部分包含一个或多个用于实现规定的逻辑功能的可执行指令。也应当注意,在有些作为替换的实现中,方框中所标注的功能也可以以不同于附图中所标注的顺序发生。例如,两个接连地表示的方框实际上可以基本并行地执行,它们有时也可以按相反的顺序执行,这依所涉及的功能而定。也要注意的是,框图和/或流程图中的每个方框、以及框图和/或流程图中的方框的组合,可以用执行规定的功能或操作的专用的基于硬件的系统来实现,或者可以用专用硬件与计算机指令的组合来实现。The flowchart and block diagrams in the Figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in a flowchart or block diagram may represent a module, program segment, or portion of code that contains one or more logical functions for implementing specified executable instructions. It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or they may sometimes be executed in the reverse order, depending upon the functionality involved. It should also be noted that each block of the block diagrams and/or flowchart illustrations, and combinations of blocks in the block diagrams and/or flowchart illustrations, can be implemented by a dedicated hardware-based system that performs the specified functions or operations , or may be implemented by a combination of dedicated hardware and computer instructions.
描述于本公开实施例中所涉及到的单元可以通过软件的方式实现,也可以通过硬件的方式来实现。其中,单元的名称在某种情况下并不构成对该单元本身的限定。The units involved in the embodiments described in the present disclosure may be implemented by software or by hardware. Wherein, the name of a unit does not constitute a limitation of the unit itself under certain circumstances.
本文中以上描述的功能可以至少部分地由一个或多个硬件逻辑部件来执行。例如,非限制性地,可以使用的示范类型的硬件逻辑部件包括:现场可编程门阵列(FPGA)、专用集成电路(ASIC)、专用标准产品(ASSP)、片上系统(SOC)、复杂可编程逻辑设备(CPLD)等等。The functions described herein above may be performed at least in part by one or more hardware logic components. For example, without limitation, exemplary types of hardware logic components that may be used include: Field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), System on Chips (SOCs), Complex Programmable Logical device (CPLD) and so on.
在本公开的上下文中,机器可读介质可以是有形的介质,其可以包含或存储以供指令执行系统、装置或设备使用或与指令执行系统、装置或设备结合地使用的程序。机器可读介质可以是机器可读信号介质或机器可读储存介质。机器可读介质可以包括但不限于电子的、磁性的、光学的、电磁的、红外的、或半导体系统、装置或设备,或者上述内容的任何合适组合。机器可读存储介质的更具体示例会包括基于一个或多个线的电气连接、便携式计算机盘、硬盘、随机存取存储器(RAM)、只读存储器(ROM)、可擦除可编程只读存储器(EPROM或快闪存储器)、光纤、便捷式紧凑盘只读存储器 (CD-ROM)、光学储存设备、磁储存设备、或上述内容的任何合适组合。In the context of the present disclosure, a machine-readable medium may be a tangible medium that may contain or store a program for use by or in conjunction with an instruction execution system, apparatus, or device. A machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, electronic, magnetic, optical, electromagnetic, infrared, or semiconductor systems, apparatus, or devices, or any suitable combination of the foregoing. More specific examples of machine-readable storage media would include one or more wire-based electrical connections, portable computer discs, hard drives, random access memory (RAM), read only memory (ROM), erasable programmable read only memory (EPROM or flash memory), optical fiber, compact disk read only memory (CD-ROM), optical storage, magnetic storage, or any suitable combination of the foregoing.
根据本公开的一个或多个实施例,提供了一种电子设备,包括:至少一个处理器;以及,与所述至少一个处理器通信连接的存储器;其中,所述存储器存储有能被所述至少一个处理器执行的指令,所述指令被所述至少一个处理器执行,以使所述至少一个处理器能够执行前述第一方面中的任一所述方法。According to one or more embodiments of the present disclosure, there is provided an electronic device, including: at least one processor; and a memory connected in communication with the at least one processor; wherein, the memory stores information that can be used by the Instructions executed by at least one processor, the instructions being executed by the at least one processor, so that the at least one processor can execute any one of the methods in the foregoing first aspect.
根据本公开的一个或多个实施例,提供了一种非暂态计算机可读存储介质,其特征在于,该非暂态计算机可读存储介质存储计算机指令,该计算机指令用于使计算机执行前述第一方面中的任一所述方法。According to one or more embodiments of the present disclosure, there is provided a non-transitory computer-readable storage medium, which is characterized in that the non-transitory computer-readable storage medium stores computer instructions, and the computer instructions are used to cause a computer to execute the aforementioned Any one of the methods of the first aspect.
以上描述仅为本公开的较佳实施例以及对所运用技术原理的说明。本领域技术人员应当理解,本公开中所涉及的公开范围,并不限于上述技术特征的特定组合而成的技术方案,同时也应涵盖在不脱离上述公开构思的情况下,由上述技术特征或其等同特征进行任意组合而形成的其它技术方案。例如上述特征与本公开中公开的(但不限于)具有类似功能的技术特征进行互相替换而形成的技术方案。The above description is only a preferred embodiment of the present disclosure and an illustration of the applied technical principle. Those skilled in the art should understand that the disclosure scope involved in this disclosure is not limited to the technical solution formed by the specific combination of the above-mentioned technical features, but also covers the technical solutions formed by the above-mentioned technical features or Other technical solutions formed by any combination of equivalent features. For example, a technical solution formed by replacing the above-mentioned features with (but not limited to) technical features with similar functions disclosed in this disclosure.

Claims (25)

  1. 一种音频驱动虚拟人像行为的方法,其特征在于,包括:A method for audio-driven avatar behavior, characterized in that, comprising:
    接收音频信息;receive audio messages;
    根据所述音频信息生成文本信息;generating text information according to the audio information;
    根据所述音频信息和文本信息结合场景信息生成行为模型;Generate a behavior model according to the audio information and text information combined with scene information;
    将所述音频信息、所述文本信息以及所述行为模型结合时间节点进行关联,形成音频模型关联内容;Associating the audio information, the text information, and the behavior model with time nodes to form audio model associated content;
    根据所述模型关联内容对虚拟人像行为进行驱动;Drive the behavior of the avatar according to the associated content of the model;
    进行音频信息和所述虚拟人像行为之间的同步。Synchronization between audio information and behavior of the avatar is performed.
  2. 根据权利要求1所述的方法,其特征在于,所述根据所述音频信息和文本信息结合场景信息生成行为模型,包括:The method according to claim 1, wherein said generating a behavior model according to said audio information and text information in combination with scene information comprises:
    根据接收的所述音频信息和对应的所述文本信息,结合所述场景信息,根据所述虚拟人像行为按照时间生成对应的口型、表情以及动作的行为模型。According to the received audio information and the corresponding text information, combined with the scene information, according to the behavior of the avatar, the corresponding behavior models of mouth shapes, expressions and actions are generated according to time.
  3. 根据权利要求2所述的方法,其特征在于,所述场景信息包括所述音频信息的语义、语境以及所述虚拟人像行为的上下文场景。The method according to claim 2, wherein the scene information includes the semantics and context of the audio information and the context scene of the avatar's behavior.
  4. 根据权利要求1所述的方法,其特征在于,所述将所述音频信息、所述文本信息以及所述行为模型结合时间节点进行关联,形成音频模型关联内容,包括:The method according to claim 1, wherein said associating said audio information, said text information, and said behavior model with time nodes to form audio model related content includes:
    根据所述音频信息、所述音频信息对应的文本信息以及所述行为模型,通过起始时间和持续时间对应的关键时间节点进行关联,形成以时间节点进行关联的音频模型关联内容;According to the audio information, the text information corresponding to the audio information, and the behavior model, the key time nodes corresponding to the start time and the duration are associated to form audio model associated content associated with time nodes;
    所述音频模型关联内容中包括所述音频信息、文本信息、行为模型内 容以及关联关系。The associated content of the audio model includes the audio information, text information, behavior model content and association relationship.
  5. 根据权利要求1所述的方法,其特征在于,所述根据所述模型关联内容对虚拟人像行为进行驱动,包括:The method according to claim 1, wherein said driving the behavior of the avatar according to the associated content of the model comprises:
    将关联后的所述模型关联内容进行解析,获取所述音频信息、文本信息和行为模型信息,以及上述信息之间的关联关系;Analyzing the associated content of the associated models to obtain the audio information, text information, and behavior model information, as well as the association relationship between the above information;
    通过所述模型关联内容中的所述行为模型信息对虚拟人像行为进行驱动。The behavior of the avatar is driven by the behavior model information in the model-associated content.
  6. 根据权利要求1所述的方法,其特征在于,所述进行音频信息和所述虚拟人像行为之间的同步,包括:The method according to claim 1, wherein the synchronization between the audio information and the behavior of the avatar comprises:
    建立具有时间节点分布的时间轴;Establish a time axis with time node distribution;
    按照所述时间轴上的时间节点将音频信息,文本信息以及行为模型信息,通过所述时间节点进行关联,形成以所述时间节点相关联的关联关系;Associating audio information, text information, and behavior model information through the time nodes according to the time nodes on the time axis to form an association relationship associated with the time nodes;
    通过所述音频信息和所述关联关系,进行音频信息和所述虚拟人像行为之间的同步;Synchronize the audio information and the behavior of the avatar through the audio information and the association relationship;
    所述时间节点包括起始时间和持续时间对应的时间节点。The time nodes include time nodes corresponding to the start time and the duration.
  7. 根据权利要求6所述的方法,其特征在于,所述进行音频信息和所述虚拟人像行为之间的同步,进一步包括:The method according to claim 6, wherein the synchronization between the audio information and the behavior of the avatar further comprises:
    将所述音频信息划分为多个片段,每个片段有各自的起始时间和持续时间;dividing the audio information into a plurality of segments, each segment having a respective start time and duration;
    将对应的所述文本信息与所述音频信息通过起始时间和持续时间对应的时间节点进行同步;Synchronizing the corresponding text information and the audio information through the time nodes corresponding to the start time and duration;
    将所述行为模型信息中与所述音频信息最相关的第一行为信息通过起始时间和持续时间对应的时间节点与所述音频信息进行同步。Synchronize the first behavior information most related to the audio information in the behavior model information with the audio information through a time node corresponding to a start time and a duration.
  8. 根据权利要求7所述的方法,其特征在于,所述进行音频信息和 所述虚拟人像行为之间的同步,进一步包括:The method according to claim 7, wherein the synchronization between the audio information and the behavior of the avatar further comprises:
    将所述行为模型信息中与所述音频信息次相关的第二行为信息通过起始时间和持续时间对应的时间节点与所述音频信息进行同步;Synchronizing second behavior information in the behavior model information that is secondarily related to the audio information with the audio information at a time node corresponding to a start time and a duration;
    以此按照相关顺序,将所述行为模型信息中与所述音频信息依次相关的第N行为信息通过起始时间和持续时间对应的时间节点与所述音频信息进行同步;In this way, according to the relevant order, the Nth behavior information in the behavior model information that is sequentially related to the audio information is synchronized with the audio information through the time node corresponding to the start time and duration;
    所述N为大于1的自然数。The N is a natural number greater than 1.
  9. 根据权利要求8所述的方法,其特征在于,所述方法还包括:The method according to claim 8, characterized in that the method further comprises:
    在所述时间轴上设置非关联时间节点阵列;setting an array of non-associated time nodes on the time axis;
    将所述行为模型信息中与所述音频信息不相关的行为信息按照所述非关联时间节点阵列进行驱动。The behavior information unrelated to the audio information in the behavior model information is driven according to the non-associated time node array.
  10. 根据权利要求9所述的方法,其特征在于,所述非关联时间节点设置成等时间间隔或离散时间间隔。The method according to claim 9, characterized in that the non-associated time nodes are set at equal time intervals or discrete time intervals.
  11. 根据权利要求8所述的方法,其特征在于,所述最相关的第一行为信息为口型行为信息;所述次相关的第二行为信息为表情行为信息;所述依次相关的第N行为信息包括肢体动作行为信息。The method according to claim 8, characterized in that, the first most relevant behavior information is lip behavior information; the second relevant behavior information is facial expression behavior information; the Nth behavior information that is sequentially related The information includes body movement behavior information.
  12. 一种音频驱动虚拟人像行为的装置,其特征在于,包括:An audio-driven device for avatar behavior, characterized in that it comprises:
    接收模块,用于接收音频信息;A receiving module, configured to receive audio information;
    文本信息生成模块,用于根据所述音频信息生成文本信息;A text information generating module, configured to generate text information according to the audio information;
    行为模型生成模块,用于根据所述音频信息和文本信息结合场景信息生成行为模型;A behavior model generation module, used to generate a behavior model according to the audio information and text information in combination with scene information;
    音频模型关联模块,用于将所述音频信息、所述文本信息以及所述行为模型结合时间节点进行关联,形成音频模型关联内容;An audio model association module, configured to associate the audio information, the text information, and the behavior model with time nodes to form audio model associated content;
    驱动模块,用于根据所述模型关联内容对虚拟人像行为进行驱动;A driving module, configured to drive the behavior of the avatar according to the associated content of the model;
    同步模块,用于进行音频信息和所述虚拟人像行为之间的同步。The synchronization module is used for synchronizing the audio information and the behavior of the avatar.
  13. 根据权利要求12所述的装置,其特征在于,所述行为模型生成模块,具体用于:The device according to claim 12, wherein the behavior model generation module is specifically used for:
    根据接收的所述音频信息和对应的所述文本信息,结合所述场景信息,根据所述虚拟人像行为按照时间生成对应的口型、表情以及动作的行为模型。According to the received audio information and the corresponding text information, combined with the scene information, according to the behavior of the avatar, the corresponding behavior models of mouth shapes, expressions and actions are generated according to time.
  14. 根据权利要求13所述的装置,其特征在于,所述场景信息包括所述音频信息的语义、语境以及所述虚拟人像行为的上下文场景。The device according to claim 13, wherein the scene information includes the semantics and context of the audio information and the context scene of the avatar's behavior.
  15. 根据权利要求12所述的装置,其特征在于,所述音频模型关联模块,具体用于:The device according to claim 12, wherein the audio model association module is specifically used for:
    根据所述音频信息、所述音频信息对应的文本信息以及所述行为模型,通过起始时间和持续时间对应的关键时间节点进行关联,形成以时间节点进行关联的音频模型关联内容;According to the audio information, the text information corresponding to the audio information, and the behavior model, the key time nodes corresponding to the start time and the duration are associated to form audio model associated content associated with time nodes;
    所述音频模型关联内容中包括所述音频信息、文本信息、行为模型内容以及关联关系。The associated content of the audio model includes the audio information, text information, behavior model content and association relationship.
  16. 根据权利要求12所述的装置,其特征在于,所述驱动模块,具体用于:The device according to claim 12, wherein the drive module is specifically used for:
    将关联后的所述模型关联内容进行解析,获取所述音频信息、文本信息和行为模型信息,以及上述信息之间的关联关系;Analyzing the associated content of the associated models to obtain the audio information, text information, and behavior model information, as well as the association relationship between the above information;
    通过所述模型关联内容中的所述行为模型信息对虚拟人像行为进行驱动。The behavior of the avatar is driven by the behavior model information in the model-associated content.
  17. 根据权利要求12所述的装置,其特征在于,所述同步模块,具体用于:The device according to claim 12, wherein the synchronization module is specifically used for:
    建立具有时间节点分布的时间轴;Establish a time axis with time node distribution;
    按照所述时间轴上的时间节点将音频信息,文本信息以及行为模型信息,通过所述时间节点进行关联,形成以所述时间节点相关联的关联关系;Associating audio information, text information, and behavior model information through the time nodes according to the time nodes on the time axis to form an association relationship associated with the time nodes;
    通过所述音频信息和所述关联关系,进行音频信息和所述虚拟人像行为之间的同步;Synchronize the audio information and the behavior of the avatar through the audio information and the association relationship;
    所述时间节点包括起始时间和持续时间对应的时间节点。The time nodes include time nodes corresponding to the start time and the duration.
  18. 根据权利要求17所述的装置,其特征在于,所述同步模块,进一步用于:The device according to claim 17, wherein the synchronization module is further used for:
    将所述音频信息划分为多个片段,每个片段有各自的起始时间和持续时间;dividing the audio information into a plurality of segments, each segment having a respective start time and duration;
    将对应的所述文本信息与所述音频信息通过起始时间和持续时间对应的时间节点进行同步;Synchronizing the corresponding text information and the audio information through the time nodes corresponding to the start time and duration;
    将所述行为模型信息中与所述音频信息最相关的第一行为信息通过起始时间和持续时间对应的时间节点与所述音频信息进行同步。Synchronize the first behavior information most related to the audio information in the behavior model information with the audio information through a time node corresponding to a start time and a duration.
  19. 根据权利要求18所述的装置,其特征在于,所述同步模块,进一步用于:The device according to claim 18, wherein the synchronization module is further used for:
    将所述行为模型信息中与所述音频信息次相关的第二行为信息通过起始时间和持续时间对应的时间节点与所述音频信息进行同步;Synchronizing second behavior information in the behavior model information that is secondarily related to the audio information with the audio information at a time node corresponding to a start time and a duration;
    以此按照相关顺序,将所述行为模型信息中与所述音频信息依次相关的第N行为信息通过起始时间和持续时间对应的时间节点与所述音频信息进行同步;In this way, according to the relevant order, the Nth behavior information in the behavior model information that is sequentially related to the audio information is synchronized with the audio information through the time node corresponding to the start time and duration;
    所述N为大于1的自然数。The N is a natural number greater than 1.
  20. 根据权利要求19所述的装置,其特征在于,所述同步模块,进一步用于:The device according to claim 19, wherein the synchronization module is further used for:
    在所述时间轴上设置非关联时间节点阵列;setting an array of non-associated time nodes on the time axis;
    将所述行为模型信息中与所述音频信息不相关的行为信息按照所述非关联时间节点阵列进行驱动。The behavior information unrelated to the audio information in the behavior model information is driven according to the non-associated time node array.
  21. 根据权利要求20所述的装置,其特征在于,所述非关联时间节点设置成等时间间隔或离散时间间隔。The device according to claim 20, wherein the non-associated time nodes are set as equal time intervals or discrete time intervals.
  22. 根据权利要求19所述的装置,其特征在于,所述最相关的第一行为信息为口型行为信息;所述次相关的第二行为信息为表情行为信息;所述依次相关的第N行为信息包括肢体动作行为信息。The device according to claim 19, characterized in that, the first most relevant behavior information is mouth-shaped behavior information; the second relevant second behavior information is facial expression behavior information; the Nth behavior information that is sequentially related The information includes body action behavior information.
  23. 一种电子设备,包括:An electronic device comprising:
    存储器,用于存储计算机可读指令;以及memory for storing computer readable instructions; and
    处理器,用于运行所述计算机可读指令,使得所述电子设备实现根据权利要求1-11中任意一项所述的方法。A processor, configured to run the computer-readable instructions, so that the electronic device implements the method according to any one of claims 1-11.
  24. 一种计算机存储介质,所述存储介质中存储有至少一可执行指令,所述可执行指令使处理器执行根据权利要求1-11任一项所述音频驱动虚拟人像行为的方法的步骤。A computer storage medium, wherein at least one executable instruction is stored in the storage medium, and the executable instruction causes a processor to execute the steps of the method for audio-driven avatar behavior according to any one of claims 1-11.
  25. 一种计算机程序,包括指令,当其在计算机上运行时,使得计算机执行根据权利要求1-11任一项所述的音频驱动虚拟人像行为的方法。A computer program, including instructions, when running on a computer, causes the computer to execute the audio-driven avatar behavior method according to any one of claims 1-11.
PCT/CN2022/084697 2021-08-03 2022-03-31 Method and apparatus for audio driving of avatar, and electronic device WO2023010873A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202110888459.X 2021-08-03
CN202110888459.XA CN115220682A (en) 2021-08-03 2021-08-03 Method and device for driving virtual portrait by audio and electronic equipment

Publications (1)

Publication Number Publication Date
WO2023010873A1 true WO2023010873A1 (en) 2023-02-09

Family

ID=83605992

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/084697 WO2023010873A1 (en) 2021-08-03 2022-03-31 Method and apparatus for audio driving of avatar, and electronic device

Country Status (2)

Country Link
CN (1) CN115220682A (en)
WO (1) WO2023010873A1 (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116778041B (en) * 2023-08-22 2023-12-12 北京百度网讯科技有限公司 Multi-mode-based face image generation method, model training method and equipment
CN117058286B (en) * 2023-10-13 2024-01-23 北京蔚领时代科技有限公司 Method and device for generating video by using word driving digital person

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112669846A (en) * 2021-03-16 2021-04-16 深圳追一科技有限公司 Interactive system, method, device, electronic equipment and storage medium
CN112667068A (en) * 2019-09-30 2021-04-16 北京百度网讯科技有限公司 Virtual character driving method, device, equipment and storage medium
US20210192824A1 (en) * 2018-07-10 2021-06-24 Microsoft Technology Licensing, Llc Automatically generating motions of an avatar
CN113192161A (en) * 2021-04-22 2021-07-30 清华珠三角研究院 Virtual human image video generation method, system, device and storage medium

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20210192824A1 (en) * 2018-07-10 2021-06-24 Microsoft Technology Licensing, Llc Automatically generating motions of an avatar
CN112667068A (en) * 2019-09-30 2021-04-16 北京百度网讯科技有限公司 Virtual character driving method, device, equipment and storage medium
CN112669846A (en) * 2021-03-16 2021-04-16 深圳追一科技有限公司 Interactive system, method, device, electronic equipment and storage medium
CN113192161A (en) * 2021-04-22 2021-07-30 清华珠三角研究院 Virtual human image video generation method, system, device and storage medium

Also Published As

Publication number Publication date
CN115220682A (en) 2022-10-21

Similar Documents

Publication Publication Date Title
US11158102B2 (en) Method and apparatus for processing information
WO2023010873A1 (en) Method and apparatus for audio driving of avatar, and electronic device
WO2022121601A1 (en) Live streaming interaction method and apparatus, and device and medium
US20240107127A1 (en) Video display method and apparatus, video processing method, apparatus, and system, device, and medium
JP6936298B2 (en) Methods and devices for controlling changes in the mouth shape of 3D virtual portraits
JP2021192222A (en) Video image interactive method and apparatus, electronic device, computer readable storage medium, and computer program
EP4099709A1 (en) Data processing method and apparatus, device, and readable storage medium
WO2023125374A1 (en) Image processing method and apparatus, electronic device, and storage medium
WO2022170848A1 (en) Human-computer interaction method, apparatus and system, electronic device and computer medium
JP2022551660A (en) SCENE INTERACTION METHOD AND DEVICE, ELECTRONIC DEVICE AND COMPUTER PROGRAM
CN111145777A (en) Virtual image display method and device, electronic equipment and storage medium
US20230421716A1 (en) Video processing method and apparatus, electronic device and storage medium
WO2023083142A1 (en) Sentence segmentation method and apparatus, storage medium, and electronic device
CN109600559B (en) Video special effect adding method and device, terminal equipment and storage medium
CN110753238A (en) Video processing method, device, terminal and storage medium
WO2021057740A1 (en) Video generation method and apparatus, electronic device, and computer readable medium
EP4343614A1 (en) Information processing method and apparatus, device, readable storage medium and product
CN112364144B (en) Interaction method, device, equipment and computer readable medium
CN115691544A (en) Training of virtual image mouth shape driving model and driving method, device and equipment thereof
US20240038273A1 (en) Video generation method and apparatus, electronic device, and storage medium
JP6949931B2 (en) Methods and devices for generating information
CN112785669B (en) Virtual image synthesis method, device, equipment and storage medium
WO2023065963A1 (en) Interactive display method and apparatus, electronic device, and storage medium
CN115222857A (en) Method, apparatus, electronic device and computer readable medium for generating avatar
CN112235183B (en) Communication message processing method and device and instant communication client

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22851598

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE