WO2023010873A1

WO2023010873A1 - Method and apparatus for audio driving of avatar, and electronic device

Info

Publication number: WO2023010873A1
Application number: PCT/CN2022/084697
Authority: WO
Inventors: 祝丰年; 张保
Original assignee: 达闼机器人股份有限公司
Priority date: 2021-08-03
Filing date: 2022-03-31
Publication date: 2023-02-09
Also published as: CN115220682A

Abstract

A method and an apparatus for audio driving of an avatar, and an electronic device, the method comprising: receiving audio information (S101); on the basis of the audio information, generating text information (S102); on the basis of the audio information and the text information in combination with scene information, generating a behaviour model (S103); associating the audio information, the text information, and the behaviour model with a time node to form audio model associated content (S104); on the basis of the model associated content, driving the behaviour of an avatar (S105); and implementing synchronisation between the audio information and the behaviour of the avatar (S106). The present method can associate and drive the behaviour of the avatar, and implement time node synchronisation of the audio information, text information and behaviour of the avatar, accurately synchronising audio information and avatar mouth movements, and simultaneously combining and synchronising facial expressions and body movements with current audio content.

Description

Method, device and electronic equipment for audio-driven virtual portrait

cross reference

This application claims the priority of the Chinese patent application submitted on August 3, 2021, with the application number "202110888459.X" and the title of the invention "A Method, Device and Electronic Equipment for Audio-Driven Virtual Portrait Behavior", all of which The contents are incorporated by reference in this application.

technical field

The present disclosure relates to the field of avatars, and in particular to a method, device and electronic equipment for audio driving avatars.

Background technique

Traditional interactive smart devices, when interacting with users, often only involve the simple output of voice by the avatar in the interaction with the avatar, and do not combine the mouth shape of the avatar, and the facial features of the avatar are single. Rich expressions of joy, anger, sorrow and joy. In the traditional audio-driven avatar behavior scheme, even if there is a change in the mouth shape of the avatar in the voice interaction with the user, it is only a repetitive simple opening and closing action. The mouth shape of the avatar in the smart device, the face Expressions and body behaviors cannot generate lip deformation coefficients synchronously through audio real-time streams, and driving avatars cannot perform precise lip movements and realistic facial expressions.

Therefore, the following problems usually exist in the prior art: it is often time-consuming to generate the lip deformation coefficients, which leads to the inability to accurately synchronize the audio information and the lip motion of the avatar. At the same time, it is impossible to combine and synchronize facial expressions and body movements with the current audio content.

Contents of the invention

The purpose of the embodiments of the present invention is to provide a method for audio-driven avatar behavior, which drives virtual tasks to execute mouth shapes, facial expressions and related body movements according to the semantics and context of the current audio information. Through preprocessing on the model generation server or module, the behavior model data of mouth shapes, facial expressions and body movements are generated, and then the audio information, the corresponding text information and the behavior model are associated at key time points to form the associated content of the audio model. The behavior of the avatar is driven according to the associated content of the model, and the audio information and the behavior of the avatar can be synchronized.

In order to achieve the above purpose, in the first aspect, an embodiment of the present invention provides a method for audio-driven avatar behavior, including:

receive audio messages;

generating text information according to the audio information;

Generate a behavior model according to the audio information and text information combined with scene information;

Associating the audio information, the text information, and the behavior model with time nodes to form audio model associated content;

Drive the behavior of the avatar according to the associated content of the model;

Synchronization between audio information and behavior of the avatar is performed.

Further, the generating a behavior model according to the audio information and text information combined with scene information includes:

According to the received audio information and the corresponding text information, combined with the scene information, according to the behavior of the avatar, the corresponding behavior models of mouth shapes, expressions and actions are generated according to time.

Further, the scene information includes the semantics and context of the audio information and the context scene of the avatar's behavior.

Further, associating the audio information, the text information, and the behavior model with time nodes to form audio model associated content includes:

According to the audio information, the text information corresponding to the audio information, and the behavior model, the key time nodes corresponding to the start time and the duration are associated to form audio model associated content associated with time nodes;

The associated content of the audio model includes the audio information, text information, behavior model content and association relationship.

Further, the driving of the behavior of the avatar according to the associated content of the model includes:

Analyzing the associated content of the associated models to obtain the audio information, text information, and behavior model information, as well as the association relationship between the above information;

The behavior of the avatar is driven by the behavior model information in the model-associated content.

Further, the synchronization between the audio information and the behavior of the avatar includes:

Establish a time axis with time node distribution;

Associating audio information, text information, and behavior model information through the time nodes according to the time nodes on the time axis to form an association relationship associated with the time nodes;

Synchronize the audio information and the behavior of the avatar through the audio information and the association relationship;

The time nodes include time nodes corresponding to the start time and the duration.

Further, the synchronization between the audio information and the behavior of the avatar further includes:

dividing the audio information into a plurality of segments, each segment having a respective start time and duration;

Synchronizing the corresponding text information and the audio information through the time nodes corresponding to the start time and duration;

Synchronize the first behavior information most related to the audio information in the behavior model information with the audio information through a time node corresponding to a start time and a duration.

The synchronization between the audio information and the behavior of the avatar further includes:

Synchronizing second behavior information in the behavior model information that is secondarily related to the audio information with the audio information at a time node corresponding to a start time and a duration;

In this way, according to the relevant order, the Nth behavior information in the behavior model information that is sequentially related to the audio information is synchronized with the audio information through the time node corresponding to the start time and duration;

The N is a natural number greater than 1.

Further, the method also includes:

setting an array of non-associated time nodes on the time axis;

The behavior information unrelated to the audio information in the behavior model information is driven according to the non-associated time node array.

Further, the non-associated time nodes are set as equal time intervals or discrete time intervals.

Further, the most relevant first behavior information is lip behavior information; the second most relevant behavior information is expression behavior information; and the sequentially related Nth behavior information includes body movement behavior information.

In the second aspect, an embodiment of the present disclosure provides an audio-driven device for avatar behavior, including:

A receiving module, configured to receive audio information;

A text information generating module, configured to generate text information according to the audio information;

A behavior model generation module, used to generate a behavior model according to the audio information and text information in combination with scene information;

The audio model association module is used to associate the audio information, the text information and the behavior model in conjunction with time nodes to form audio model associated content;

A driving module, configured to drive the behavior of the avatar according to the associated content of the model;

The synchronization module is used for synchronizing the audio information and the behavior of the avatar.

In a third aspect, an embodiment of the present disclosure provides an electronic device, including:

memory for storing computer readable instructions; and

A processor, configured to run the computer-readable instructions, so that the electronic device implements the method described in any one of the above first aspects.

In a fourth aspect, an embodiment of the present disclosure provides a non-transitory computer-readable storage medium for storing computer-readable instructions. When the computer-readable instructions are executed by a computer, the computer implements the above-mentioned first aspect. any one of the methods described.

The embodiment of the present disclosure discloses a method, device, electronic device, and computer-readable storage medium for audio-driven avatar behavior, wherein the method includes: receiving audio information; generating text information according to the audio information; Information and text information are combined with scene information to generate a behavior model; the audio information, the text information, and the behavior model are associated with time nodes to form audio model related content; according to the model related content, the avatar behavior is driven ; Perform synchronization between the audio information and the behavior of the avatar. Through the audio-driven method of avatar behavior disclosed in the present disclosure, the behavior of avatar can be driven in association, and the audio information, text information and avatar behavior can be synchronized at time nodes, and the audio information can be accurately synchronized with the mouth movements of the avatar. Combine and synchronize facial expressions and body movements with current audio content.

The above description is only an overview of the technical solution of the present disclosure. In order to better understand the technical means of the present disclosure, it can be implemented according to the contents of the specification, and in order to make the above and other purposes, features and advantages of the present disclosure more obvious and understandable , the following preferred embodiments are specifically cited below, and are described in detail as follows in conjunction with the accompanying drawings.

Description of drawings

Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiment. The drawings are only for the purpose of illustrating the preferred embodiments and are not to be considered as limiting the present disclosure. Also throughout the drawings, the same reference numerals are used to designate the same components. In the attached picture:

FIG. 1 is a schematic flowchart of a method for audio-driven avatar behavior provided by an embodiment of the present disclosure;

FIG. 2 is a schematic diagram of a system for audio-driven avatar behavior provided by an embodiment of the present disclosure;

Fig. 3 is a schematic diagram of the audio model associated content structure provided by an embodiment of the present disclosure;

FIG. 4 is a schematic diagram of a data format of audio information provided by an embodiment of the present disclosure;

FIG. 5 is a schematic diagram of time synchronization of elements of audio model-related content provided by an embodiment of the present disclosure;

FIG. 6 is a schematic diagram of an audio-driven avatar behavior device provided by another embodiment of the present disclosure;

Fig. 7 is a schematic structural diagram of an electronic device provided by another embodiment of the present disclosure.

Detailed ways

In order to describe the technical content of the present disclosure more clearly, further description will be given below in conjunction with specific embodiments.

The implementations described in the following exemplary embodiments do not represent all implementations consistent with this application. Rather, they are merely examples of apparatuses and methods consistent with aspects of the present application as recited in the appended claims.

Embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. Although certain embodiments of the present disclosure are shown in the drawings, it should be understood that the disclosure may be embodied in various forms and should not be construed as limited to the embodiments set forth herein; A more thorough and complete understanding of the present disclosure. It should be understood that the drawings and embodiments of the present disclosure are for exemplary purposes only, and are not intended to limit the protection scope of the present disclosure.

It should be understood that the various steps described in the method implementations of the present disclosure may be executed in different orders, and/or executed in parallel. Additionally, method embodiments may include additional steps and/or omit performing illustrated steps. The scope of the present disclosure is not limited in this respect.

As used herein, the term "comprise" and its variations are open-ended, ie "including but not limited to". The term "based on" is "based at least in part on". The term "one embodiment" means "at least one embodiment"; the term "another embodiment" means "at least one further embodiment"; the term "some embodiments" means "at least some embodiments." Relevant definitions of other terms will be given in the description below.

It should be noted that concepts such as "first" and "second" mentioned in this disclosure are only used to distinguish different devices, modules or units, and are not used to limit the sequence of functions performed by these devices, modules or units or interdependence.

It should be noted that the modifications of "one" and "multiple" mentioned in the present disclosure are illustrative and not restrictive, and those skilled in the art should understand that unless the context clearly indicates otherwise, it should be understood as "one or more" multiple".

The terminology used in this disclosure is for the purpose of describing particular embodiments only and is not intended to be limiting of the application. As used in this disclosure and the appended claims, the singular forms "a", "the", and "the" are intended to include the plural forms as well, unless the context clearly dictates otherwise. The disclosed embodiments are described in detail below with reference to the accompanying drawings.

Fig. 1 is a schematic flow chart of the method for audio-driven avatar behavior provided by an embodiment of the present disclosure. The audio-driven avatar behavior method provided in this embodiment can be executed by an audio-driven avatar behavior device, which can realize Implemented as software, or implemented as a combination of software and hardware, the device may be integrated and set in a certain device in the audio-driven avatar behavior system, such as a terminal device. As shown in Figure 1, the method includes the following steps:

Step S101: Receive audio information.

In step S101, the smart device receives audio information. Here, the smart device can be a smart robot, a smart terminal, or other smart devices with screen display. Play the animation of the avatar by itself, and also interact with the user. The smart device in this embodiment takes the smart robot as an example. It has an anthropomorphic form, and the display screen of the head has the facial features of the virtual portrait. After the smart device receives the audio information, it cooperates with the mouth shape of the virtual portrait. The corresponding voice is played out synchronously, and at the same time, it can cooperate with the anthropomorphic expressions of the robot's virtual portrait, such as sad, laughing, smiling, crying, helpless, embarrassing and other expressions. In addition, the robot can also realize other behaviors, such as waving hands, spreading hands, shaking the head, nodding, etc., and can also perform synchronously with the mouth shape and expression of the virtual portrait according to the audio information. The receiving audio information in this embodiment can adopt when the user interacts with the intelligent robot, the intelligent robot collects the voice information of the user in real time, and uses it as the source of the audio information, and also can call the audio information in the external or internal storage device, the audio The source of information is not limited to this.

Step S102: Generate text information according to the audio information.

In step S102, the process of dialogue and interaction between the user and the intelligent robot involves the input and reception of audio information, the dialogue between the user and the robot includes dialogue information related to audio, and the dialogue information includes characteristic data of the dialogue content, then the process of acquiring dialogue information It is the process of determining the feature data of the dialogue content: obtaining the original text information, which is the text information corresponding to the dialogue content; extracting the text feature data from the original text information; using the text feature data as the feature data of the dialogue content .

Specifically, the process of dialogue and interaction between the robot and the interactive object is usually: the robot speaks a paragraph, and the interactive object replies to the paragraph; or, the interactive object speaks a paragraph, and the robot replies to the interactive object; it can also be an interactive object and the robot may speak the first paragraph at the same time. Therefore, the original text information may be generated by the robot, or by the interactive object, or may be generated by both the robot and the interactive object. In this embodiment, according to the above three situations, the process of determining to obtain the original text information is introduced respectively:

Scenario 1: When the original text information is the text information corresponding to the dialogue content generated by the robot.

Obtaining the original text information specifically includes: obtaining the text information to be played by the robot; and using the text information to be played as the original text information.

Scenario 2: When the original text information is the text information corresponding to the dialogue content generated by the interactive object.

Obtaining the original text information specifically includes: collecting audio data emitted by the interactive object when speaking; performing speech recognition on the audio data, and using the speech recognition result as the original text information.

Scenario 3: When the original text information includes the text information corresponding to the dialog content generated by the robot, and the text information of the dialog content generated by the interactive object.

The text information to be played by the robot can be obtained according to the method in scenario 1, and the text information of the dialog content generated by the interactive object can be obtained according to the method in scenario 2, and the text information to be played and the text information of the dialog content generated by the interactive object can be obtained Together as the original original data, the specific acquisition process in Scenario 1 and Scenario 2 will not be repeated here.

In a specific implementation, the text feature data is extracted from the original text information, which specifically includes: inputting the original text information into a preset text extraction model to obtain the text feature data. The text extraction model is based on the original text information stored in the training library. , and the text feature data corresponding to each original text information are obtained through training.

Specifically, the original text information stored in the training database is used as the input data of the text extraction model, and the text feature data corresponding to each original text information is used as the output data, and the Recurrent Neural Network (Recurrent Neural Network, referred to as " RNN") model structure, the input data and output data are trained to determine the text extraction model, a typical recurrent neural network such as: Long Short-Term Memory (Long Short-Term Memory, referred to as "LSTM") model architecture.

After the text extraction model is constructed, the original text information is input into the text extraction model to obtain the text feature data.

Step S103: Generate a behavior model according to the audio information and text information combined with scene information.

In step S103, according to the incoming audio content and corresponding text information, analyze the current semantics, context, and context scene of the avatar dialogue, and generate corresponding mouth shapes, expressions, and action behavior models according to time according to the audio and video content . The behavior model is generated according to the audio information and text information combined with scene information, specifically, based on the received audio information and the corresponding text information, combined with the scene information, and according to the behavior of the avatar according to time. Behavioral models of corresponding mouth shapes, facial expressions, and actions. The scene information includes the semantics and context of the audio information and the context scene of the avatar's behavior.

The behavior model is obtained in advance according to the audio information in the sample training library and the behavior actions corresponding to the audio information before acquiring the dialogue information. During the dialogue interaction between the user and the robot, the content of the dialogue produced by one party of the dialogue interaction will affect the behavior of the other party during the listening process of the interactive object, and the strongly related actions, such as mouth shapes, need to be synchronized with the audio information , the related action of the second time also needs to be synchronized with the audio information, so the behavior model has a corresponding relationship with the audio information. Among them, mouth training in robot behavior is particularly important. Usually, a large amount of audio and video information is collected. The audio and video information contains a large amount of audio information and corresponding mouth movements. Big data training can be carried out based on the relationship between mouth shape and mouth shape, and the corresponding mouth shape movement can be obtained. In addition, the mouth movement of the robot can usually be combined with facial expressions and body movements at the same time. Alternatively, the dialogue content sent by the robot of the interactive object will also have an impact on its own behavior.

Among them, each dialogue information in the sample training library, and the behavior actions corresponding to each dialogue information can be obtained in the following ways:

Collect a large amount of audio and video file data, obtain a large amount of audio information and audio and video information, such as: collect 4000 audio and video files. In order to ensure the accuracy of each audio information in the sample training library, audio and video containing dialogue scenes can be collected, for example, audio and video files of talk shows. Usually, there are only two people talking in talk shows. This situation is similar to that of robots. Therefore, the audio and video files of the talk show can be used as training data to accurately train the behavior model.

Since each audio and video file contains two interactive objects, and each audio and video file contains a complete dialogue scene, the processing process for each audio and video file is: the audio belonging to the interactive object A can be collected separately through speech recognition data, and audio data belonging to interactive object B, and convert the audio data of interactive object A into text data, and convert the audio data of interactive object B into text data. At the same time, the behavior A of the interactive object A and the behavior of the interactive object B are collected through image analysis. It can be understood that interactive object A and interactive object B are only used to distinguish two interactive objects in an audio and video file, and interactive object A (or interactive object B) in each audio and video file can be different from two individuals .

Step S104: Associate the audio information, the text information, and the behavior model with time nodes to form audio model associated content.

In step S104, in the embodiment of the present disclosure, the audio information, the text information, and the behavior model are associated with time nodes to form audio model associated content, which specifically includes: according to the audio information, the audio The text information corresponding to the information and the behavior model are associated through key time nodes corresponding to the start time and duration to form audio model associated content associated with time nodes, and the audio model associated content includes the audio information , text information, behavior model content and association relationship. Wherein, the behaviors in the behavior model are sorted according to the correlation with the audio information, and the most relevant behavior includes the first behavior information. In this embodiment, the mouth movement corresponding to the audio information is the first behavior, which is related to the audio information. the strongest correlation. The behavior related to the audio information contains the second behavior information. In this embodiment, the facial expression of the avatar can be used as the second behavior, or the body movement can be used as the second behavior. Other behaviors are based on the audio information. The correlation degree is sorted, which is not strictly limited in the present disclosure.

Establish a time axis with a time node distribution, and associate audio information, text information, and behavior model information through the time nodes according to the time nodes on the time axis to form an association relationship associated with the time nodes, so The above time nodes include time nodes corresponding to the start time and the duration. Among them, the most relevant mouth movements are strictly related according to the time points corresponding to the start time and duration of the audio information, and the sequentially related behaviors can be related to the audio information according to the approximate time node. The approximate time node here can set a certain time interval, for example, in [-5s, +5s], [-3s, +3s], [-2s, +2s], [-1s, +1s], [ -0.5s,+0.5s] and so on. Unrelated actions can be related by an array of unrelated time points on the time axis. Wherein, the non-associated time nodes are set as equal time intervals or discrete time intervals.

Step S105: Drive the behavior of the avatar according to the associated content of the model.

In step S105, the behavior of the avatar is driven by the behavior model information in the model-associated content. It specifically includes: first analyzing the associated content of the model after association, obtaining the audio information, text information and behavior model information, and the association relationship between the above information; through the behavior model in the model association content The information drives the behavior of the avatar. Through the audio information and the association relationship, the driving between the audio information and the behavior of the avatar is performed. According to the order of correlation, the first behavior information most related to the audio information in the behavior model information is driven with the audio information through the time node corresponding to the start time and the duration. Then, the second behavior information related to the audio information in the behavior model information is driven through the time node corresponding to the start time and duration and the audio information; thus, the behavior model is The Nth line of information in the information that is sequentially related to the audio information is driven by the time node corresponding to the start time and the duration of the audio information; the N is a natural number greater than 1. Finally, the behavior information that is not related to the audio information in the behavior model information is driven according to the non-associated time node array. The non-associated time nodes are set as equal time intervals or discrete time intervals. The most relevant first behavior information is lip behavior information; the second most relevant behavior information is expression behavior information; and the sequentially related Nth behavior information includes body movement behavior information. It should be noted that the related sequence in this embodiment is not a strict related sequence, and those skilled in the art can adjust the related behavior sequence, which is not strictly limited in the present disclosure.

Step S106: Perform synchronization between the audio information and the behavior of the avatar.

In step S106, through the time axis with time node distribution established in the previous step, the audio information, text information and behavior model information are associated according to the time nodes on the time axis through the time nodes to form the following The association relationship associated with the time node; through the audio information and the association relationship, the synchronization between the audio information and the behavior of the avatar is performed; the time node includes the time node corresponding to the start time and duration . The synchronization between the audio information and the behavior of the avatar further includes: dividing the audio information into a plurality of segments, each segment has its own start time and duration, and the corresponding text information Synchronize with the audio information through the time node corresponding to the start time and duration, and combine the first behavior information in the behavior model information most related to the audio information through the time node corresponding to the start time and duration with The audio information is synchronized. Synchronize the second behavior information in the behavior model information that is secondarily related to the audio information with the audio information through the time node corresponding to the start time and the duration; thus, according to the relevant order, the behavior model information The Nth behavior information sequentially related to the audio information is synchronized with the audio information through the time node corresponding to the start time and the duration; the N is a natural number greater than 1. In addition, a non-associated time node array is set on the time axis; behavior information in the behavior model information that is not related to the audio information is driven according to the non-associated time node array. The non-associated time nodes are set as equal time intervals or discrete time intervals. The most relevant first behavior information is lip behavior information; the second most relevant behavior information is expression behavior information; and the sequentially related Nth behavior information includes body movement behavior information. It should be noted that the related sequence in this embodiment is not a strict related sequence, and those skilled in the art can adjust the related behavior sequence, which is not strictly limited in the present disclosure.

Fig. 2 is a schematic diagram of the audio model related content structure provided by an embodiment of the present disclosure. Wherein the associated content of the audio model is the file synthesized by the audio model file generation module in Fig. 1;

Fragmentation is performed according to audio content, text, and models. Each shard contains audio, text, text models, and the association of each data at time nodes.

The audio model file synthesized correspondingly in FIG. 1 generally consists of multiple slices as shown in FIG. 2 , and each slice has the same data structure and different data content. Take slice 1 as an example for illustration, including:

Audio segment 1: It is the first piece of audio data in the audio file. The segment of the audio segment can be a fixed size, or divided according to the audio content, and there is no restriction on the division method;

Text: the text content corresponding to the audio clip;

Behavior model: the behavior model corresponding to audio clips and scenes, including expression, body, and mouth model data, including but not limited to these types of behavior models;

Start time: This time is the start time corresponding to each of the above elements; corresponding to each of the above elements has its own start time, here is a unified introduction;

End time: This time is the end time corresponding to each of the above elements; corresponding to each of the above elements has its own end time, here is a unified introduction;

Duration: This time is the duration corresponding to each of the above elements; corresponding to each of the above elements has its own duration, here is a unified introduction.

FIG. 3 is a schematic diagram of a data format of audio information provided by an embodiment of the present disclosure.

The data format of the audio information includes: segment index, segment size, start time, end time, text information, audio size, audio information and model content. The content of the model includes behavior information and association relationships. Here, the content of the model is described emphatically. Since the composition of the audio model content has been explained in FIG. 3 , the definition and fragmentation of the model content in FIG. 2 are illustrated and described in detail in FIG. 3 .

For the definition of model content:

Among them, MoudleData contains LipSync (mouth model), Expression (expression model), Action (body movement model), and each model will further explain the detailed content of the model: Name (the name of the subdivision action), Start (the start time), End (end time), and Data (model data of the subdivision action).

Take LipSync in the figure as an example, where "LipSync" is the name of the mouth model, which contains a number of different mouth model elements X, Y, etc., each model element contains:

Name: model name, to distinguish different models;

Start: the time corresponding to the start of the model;

End: the time corresponding to the end of the model;

Data: the corresponding model data;

Fig. 4 shows a schematic diagram of time synchronization of elements of audio model-associated content provided by an embodiment of the present disclosure. This graphic is a schematic diagram of the synchronization of each element of the audio model file, which includes time (the working time of the audio model file), audio (each segment of the audio), mouth shape, text, body movement, expression, and each element in the distribution on the time axis.

Taking "Hello" as an example, the corresponding audio is divided into two segments by words, and each segment has its own time description (start and duration); the description time of the corresponding text has the same time associated with the audio Description; the mouth shape model is related to the corresponding audio content, for example, the first audio clip in Figure 4 corresponds to the 1, 2, and 3 mouth shape models; the action behavior of waving is performed during the playback of the first audio, and at the same time The act of blinking. All elements work according to the same timeline.

FIG. 5 is a schematic diagram of an audio-driven device for avatar behavior provided by another embodiment of the present disclosure. The audio-driven device for avatar behavior includes: a receiving module 501 , a text information generating module 502 , a behavior model generating module 503 , an audio model association module 504 , a driving module 505 and a synchronization module 506 . in:

The receiving module 501 is configured to receive audio information.

The smart device receives audio information. The smart device in this embodiment takes an intelligent robot as an example. It has an anthropomorphic form, and the facial features of a virtual person are displayed on the display screen of the head. After the smart device receives the audio information, it cooperates with the virtual The mouth shape in the portrait will play the corresponding voice synchronously, and at the same time, it can match the anthropomorphic expressions of the robot's virtual portrait, such as sad, laughing, smiling, crying, helpless, embarrassing and other expressions. In addition, the robot can also realize other behaviors, such as waving hands, spreading hands, shaking the head, nodding, etc., and can also perform synchronously with the mouth shape and expression of the virtual portrait according to the audio information. The receiving audio information in this embodiment can adopt when the user interacts with the intelligent robot, the intelligent robot collects the voice information of the user in real time, and uses it as the source of the audio information, and also can call the audio information in the external or internal storage device, the audio The source of information is not limited to this.

The text information generation module 502 is configured to generate text information according to the audio information.

The process of dialogue and interaction between the user and the intelligent robot involves the input and reception of audio information. The dialogue between the user and the robot includes dialogue information involving audio, and the dialogue information includes the characteristic data of the dialogue content. The process of obtaining dialogue information is to determine the dialogue content. The process of feature data: obtaining original text information, which is the text information corresponding to the dialogue content; extracting text feature data from the original text information; using the text feature data as feature data of the dialogue content.

The behavior model generation module 503 is configured to generate a behavior model according to the audio information and text information combined with scene information.

According to the incoming audio content and the corresponding text information, analyze the current semantics, context and the context scene of the avatar dialogue, and generate the corresponding mouth shape, expression and action behavior model according to the time according to the audio and video content. The behavior model is generated according to the audio information and text information combined with scene information, specifically, based on the received audio information and the corresponding text information, combined with the scene information, and according to the behavior of the avatar according to time. Behavioral models of corresponding mouth shapes, facial expressions, and actions. The scene information includes the semantics and context of the audio information and the context scene of the avatar's behavior.

The audio model association module 504 is configured to associate the audio information, the text information, and the behavior model with time nodes to form audio model associated content.

Associating the audio information, the text information, and the behavior model with time nodes to form audio model associated content, specifically including: according to the audio information, the text information corresponding to the audio information, and the behavior model, The key time nodes corresponding to the start time and duration are associated to form audio model associated content associated with time nodes, and the audio model associated content includes the audio information, text information, behavior model content, and association relationship. Wherein the behaviors in the behavior model are sorted according to the correlation with the audio information, and the most relevant behavior action includes the first behavior information. In this embodiment, the mouth movement corresponding to the audio information is the first behavior action, which is related to the audio information. the strongest correlation. The behaviors related to the audio information include the second behavior information. In this embodiment, the facial expressions of the avatar can be used as the second behavior, or body movements can be used as the second behavior. Other behaviors are based on the audio information. The correlation degree is sorted, which is not strictly limited in the present disclosure. Establish a time axis with a time node distribution, and associate audio information, text information, and behavior model information through the time nodes according to the time nodes on the time axis to form an association relationship associated with the time nodes, so The above time nodes include time nodes corresponding to the start time and the duration.

The driving module 505 is configured to drive the behavior of the avatar according to the associated content of the model.

Firstly, analyze the associated content of the model after association, obtain the audio information, text information and behavior model information, and the association relationship between the above information; use the behavior model information in the model associated content to virtual Portrait behavior is driven. Through the audio information and the association relationship, the driving between the audio information and the behavior of the avatar is performed. According to the order of correlation, drive the first behavior information in the behavior model information most related to the audio information through the time node corresponding to the start time and duration and the audio information. Then, the second behavior information related to the audio information in the behavior model information is driven through the time node corresponding to the start time and duration and the audio information; thus, the behavior model is The Nth line of information in the information that is sequentially related to the audio information is driven by the time node corresponding to the start time and the duration of the audio information; the N is a natural number greater than 1. Finally, the behavior information that is not related to the audio information in the behavior model information is driven according to the non-associated time node array. The non-associated time nodes are set as equal time intervals or discrete time intervals. The most relevant first behavior information is lip behavior information; the second most relevant behavior information is expression behavior information; and the sequentially related Nth behavior information includes body movement behavior information. It should be noted that the related sequence in this embodiment is not a strict related sequence, and those skilled in the art can adjust the related behavior sequence, which is not strictly limited in the present disclosure.

The synchronization module 506 is used for synchronizing the audio information and the behavior of the avatar.

According to the time nodes on the time axis, audio information, text information and behavior model information are associated through the time nodes to form an association relationship associated with the time nodes; through the audio information and the association relationship , performing synchronization between the audio information and the behavior of the avatar; the time node includes a time node corresponding to a start time and a duration.

The synchronization module 506 is further configured to: divide the audio information into a plurality of segments, each segment has its own start time and duration, and pass the corresponding text information and the audio information through the start time The time node corresponding to the duration is synchronized, and the first behavior information most related to the audio information in the behavior model information is synchronized with the audio information through the time node corresponding to the start time and the duration.

The synchronization module 506 is further configured to: synchronize the second behavior information in the behavior model information that is secondarily related to the audio information with the audio information through the time node corresponding to the start time and duration; According to the relevant order, the Nth behavior information in the behavior model information that is sequentially related to the audio information is synchronized with the audio information through the time node corresponding to the start time and duration; the N is a natural number greater than 1 .

The synchronization module 506 is further configured to: set a non-associated time node array on the time axis; drive behavior information in the behavior model information that is not related to the audio information according to the non-associated time node array . The non-associated time nodes are set as equal time intervals or discrete time intervals.

The most relevant first behavior information is lip behavior information; the second most relevant behavior information is expression behavior information; and the sequentially related Nth behavior information includes body movement behavior information. It should be noted that the related sequence in this embodiment is not a strict related sequence, and those skilled in the art can adjust the related behavior sequence, which is not strictly limited in the present disclosure.

The device shown in FIG. 5 can execute the method of the embodiment shown in FIG. 1 . For parts not described in detail in this embodiment, reference can be made to relevant descriptions of the embodiment shown in FIG. 1 . For the execution process and technical effect of this technical solution, refer to the description in the embodiment shown in FIG. 1 , and details are not repeated here.

As shown in FIG. 6 , it shows a system schematic diagram of an audio-driven avatar behavior provided by an embodiment of the present disclosure. A schematic diagram of the system composition: where the system schematic diagram is combined with the device diagram of audio-driven avatar behavior in accompanying drawing 5, according to the logical relationship between the modules, according to the system diagram, the behavior model generation module and the audio model file are shown The logical relationship between the generation module, the audio model analysis module, and the avatar behavior driver module.

The behavior model generation module analyzes the current semantics, context and avatar dialogue context scene according to the incoming audio content and corresponding text information, and generates corresponding mouth shapes, expressions and action behavior models according to time according to the video content.

The audio model file generation module, according to the audio content, the text corresponding to the audio content, and the model data generated in the model generation module, according to the time node of the audio number playback, the audio, text and model, according to each element in the corresponding node in the audio playback The start time, duration and other key time nodes are associated to form an audio model file associated with time nodes; the file includes time nodes (start time, duration), audio files, text, and model information. The audio model file generation module corresponds to some functions of the text information generation module and the audio model association module in FIG. 6 .

The audio model file parsing module parses the audio model file to obtain audio content, model content, text content, and associations among the above contents, including but not limited to time associations. The audio model file parsing module corresponds to the audio model association module in FIG. 6 .

The avatar behavior driving module drives the avatar behavior according to the model content analyzed in the audio analysis module. The synchronization between the audio playback and the action model is performed through the association between the audio content and the model. The avatar behavior driving module corresponds to the driving module in FIG. 6 .

A device for audio-driven avatar behavior, further comprising:

The audio playing module plays the audio content parsed in the above audio model file parsing module.

The text display module displays the text content parsed in the above-mentioned audio model file parsing module.

Referring now to FIG. 7 , it shows a schematic structural diagram of an electronic device 700 suitable for implementing another embodiment of the present disclosure. The terminal equipment in the embodiment of the present disclosure may include but not limited to such as mobile phone, notebook computer, digital broadcast receiver, PDA (personal digital assistant), PAD (tablet computer), PMP (portable multimedia player), vehicle terminal (such as mobile terminals such as car navigation terminals) and fixed terminals such as digital TVs, desktop computers and the like. The electronic device shown in FIG. 7 is only an example, and should not limit the functions and application scope of the embodiments of the present disclosure.

As shown in FIG. 7 , an electronic device 700 may include a processing device (such as a central processing unit, a graphics processing unit, etc.) Various appropriate actions and processes are executed by programs in the memory (RAM) 703 . In the RAM 703, various programs and data necessary for the operation of the electronic device 700 are also stored. The processing device 701, ROM 702, and RAM 703 are connected to each other through a communication line 704. An input/output (I/O) interface 705 is also connected to the communication line 704 .

Typically, the following devices can be connected to the I/O interface 705: input devices 706 including, for example, a touch screen, touchpad, keyboard, mouse, camera, microphone, accelerometer, gyroscope, etc.; including, for example, a liquid crystal display (LCD), speaker, vibration an output device 707 such as a computer; a storage device 708 including, for example, a magnetic tape, a hard disk, etc.; and a communication device 709. The communication means 709 may allow the electronic device 700 to communicate with other devices wirelessly or by wire to exchange data. While FIG. 7 shows electronic device 700 having various means, it should be understood that implementing or having all of the means shown is not a requirement. More or fewer means may alternatively be implemented or provided.

In particular, according to an embodiment of the present disclosure, the processes described above with reference to the flowcharts can be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product, which includes a computer program carried on a non-transitory computer readable medium, where the computer program includes program code for executing the method shown in the flowchart. In such an embodiment, the computer program may be downloaded and installed from a network via communication means 709, or from storage means 708, or from ROM 702. When the computer program is executed by the processing device 701, the above-mentioned functions defined in the methods of the embodiments of the present disclosure are performed.

It should be noted that the above-mentioned computer-readable medium in the present disclosure may be a computer-readable signal medium or a computer-readable storage medium or any combination of the above two. A computer readable storage medium may be, for example, but not limited to, an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination thereof. More specific examples of computer-readable storage media may include, but are not limited to, electrical connections with one or more wires, portable computer diskettes, hard disks, random access memory (RAM), read-only memory (ROM), erasable Programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), optical storage device, magnetic storage device, or any suitable combination of the above. In the present disclosure, a computer-readable storage medium may be any tangible medium that contains or stores a program that can be used by or in conjunction with an instruction execution system, apparatus, or device. In the present disclosure, however, a computer-readable signal medium may include a data signal propagated in baseband or as part of a carrier wave carrying computer-readable program code therein. Such propagated data signals may take many forms, including but not limited to electromagnetic signals, optical signals, or any suitable combination of the foregoing. A computer-readable signal medium may also be any computer-readable medium other than a computer-readable storage medium, which can transmit, propagate, or transmit a program for use by or in conjunction with an instruction execution system, apparatus, or device . Program code embodied on a computer readable medium may be transmitted by any appropriate medium, including but not limited to wires, optical cables, RF (radio frequency), etc., or any suitable combination of the above.

In some embodiments, the client and the server can communicate using any currently known or future network protocols such as HTTP (HyperText Transfer Protocol, Hypertext Transfer Protocol), and can communicate with digital data in any form or medium The communication (eg, communication network) interconnections. Examples of communication networks include local area networks ("LANs"), wide area networks ("WANs"), internetworks (e.g., the Internet), and peer-to-peer networks (e.g., ad hoc peer-to-peer networks), as well as any currently known or future developed network of.

The above-mentioned computer-readable medium may be included in the above-mentioned electronic device, or may exist independently without being incorporated into the electronic device.

The above-mentioned computer-readable medium carries one or more programs, and when the above-mentioned one or more programs are executed by the electronic device, the electronic device: executes the interaction method in the above-mentioned embodiment.

Computer program code for carrying out operations of the present disclosure may be written in one or more programming languages, or combinations thereof, including but not limited to object-oriented programming languages—such as Java, Smalltalk, C++, and Includes conventional procedural programming languages - such as the "C" language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In cases involving a remote computer, the remote computer can be connected to the user computer through any kind of network, including a local area network (LAN) or a wide area network (WAN), or it can be connected to an external computer (such as through an Internet service provider). Internet connection).

The flowchart and block diagrams in the Figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in a flowchart or block diagram may represent a module, program segment, or portion of code that contains one or more logical functions for implementing specified executable instructions. It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or they may sometimes be executed in the reverse order, depending upon the functionality involved. It should also be noted that each block of the block diagrams and/or flowchart illustrations, and combinations of blocks in the block diagrams and/or flowchart illustrations, can be implemented by a dedicated hardware-based system that performs the specified functions or operations , or may be implemented by a combination of dedicated hardware and computer instructions.

The units involved in the embodiments described in the present disclosure may be implemented by software or by hardware. Wherein, the name of a unit does not constitute a limitation of the unit itself under certain circumstances.

The functions described herein above may be performed at least in part by one or more hardware logic components. For example, without limitation, exemplary types of hardware logic components that may be used include: Field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), System on Chips (SOCs), Complex Programmable Logical device (CPLD) and so on.

In the context of the present disclosure, a machine-readable medium may be a tangible medium that may contain or store a program for use by or in conjunction with an instruction execution system, apparatus, or device. A machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, electronic, magnetic, optical, electromagnetic, infrared, or semiconductor systems, apparatus, or devices, or any suitable combination of the foregoing. More specific examples of machine-readable storage media would include one or more wire-based electrical connections, portable computer discs, hard drives, random access memory (RAM), read only memory (ROM), erasable programmable read only memory (EPROM or flash memory), optical fiber, compact disk read only memory (CD-ROM), optical storage, magnetic storage, or any suitable combination of the foregoing.

According to one or more embodiments of the present disclosure, there is provided an electronic device, including: at least one processor; and a memory connected in communication with the at least one processor; wherein, the memory stores information that can be used by the Instructions executed by at least one processor, the instructions being executed by the at least one processor, so that the at least one processor can execute any one of the methods in the foregoing first aspect.

According to one or more embodiments of the present disclosure, there is provided a non-transitory computer-readable storage medium, which is characterized in that the non-transitory computer-readable storage medium stores computer instructions, and the computer instructions are used to cause a computer to execute the aforementioned Any one of the methods of the first aspect.

The above description is only a preferred embodiment of the present disclosure and an illustration of the applied technical principle. Those skilled in the art should understand that the disclosure scope involved in this disclosure is not limited to the technical solution formed by the specific combination of the above-mentioned technical features, but also covers the technical solutions formed by the above-mentioned technical features or Other technical solutions formed by any combination of equivalent features. For example, a technical solution formed by replacing the above-mentioned features with (but not limited to) technical features with similar functions disclosed in this disclosure.

Claims

A method for audio-driven avatar behavior, characterized in that, comprising:

receive audio messages;

generating text information according to the audio information;

Generate a behavior model according to the audio information and text information combined with scene information;

Associating the audio information, the text information, and the behavior model with time nodes to form audio model associated content;

Drive the behavior of the avatar according to the associated content of the model;

Synchronization between audio information and behavior of the avatar is performed.
The method according to claim 1, wherein said generating a behavior model according to said audio information and text information in combination with scene information comprises:

According to the received audio information and the corresponding text information, combined with the scene information, according to the behavior of the avatar, the corresponding behavior models of mouth shapes, expressions and actions are generated according to time.
The method according to claim 2, wherein the scene information includes the semantics and context of the audio information and the context scene of the avatar's behavior.
The method according to claim 1, wherein said associating said audio information, said text information, and said behavior model with time nodes to form audio model related content includes:

According to the audio information, the text information corresponding to the audio information, and the behavior model, the key time nodes corresponding to the start time and the duration are associated to form audio model associated content associated with time nodes;

The associated content of the audio model includes the audio information, text information, behavior model content and association relationship.
The method according to claim 1, wherein said driving the behavior of the avatar according to the associated content of the model comprises:

Analyzing the associated content of the associated models to obtain the audio information, text information, and behavior model information, as well as the association relationship between the above information;

The behavior of the avatar is driven by the behavior model information in the model-associated content.
The method according to claim 1, wherein the synchronization between the audio information and the behavior of the avatar comprises:

Establish a time axis with time node distribution;

Associating audio information, text information, and behavior model information through the time nodes according to the time nodes on the time axis to form an association relationship associated with the time nodes;

Synchronize the audio information and the behavior of the avatar through the audio information and the association relationship;

The time nodes include time nodes corresponding to the start time and the duration.
The method according to claim 6, wherein the synchronization between the audio information and the behavior of the avatar further comprises:

dividing the audio information into a plurality of segments, each segment having a respective start time and duration;

Synchronizing the corresponding text information and the audio information through the time nodes corresponding to the start time and duration;

Synchronize the first behavior information most related to the audio information in the behavior model information with the audio information through a time node corresponding to a start time and a duration.
The method according to claim 7, wherein the synchronization between the audio information and the behavior of the avatar further comprises:

Synchronizing second behavior information in the behavior model information that is secondarily related to the audio information with the audio information at a time node corresponding to a start time and a duration;

In this way, according to the relevant order, the Nth behavior information in the behavior model information that is sequentially related to the audio information is synchronized with the audio information through the time node corresponding to the start time and duration;

The N is a natural number greater than 1.
The method according to claim 8, characterized in that the method further comprises:

setting an array of non-associated time nodes on the time axis;

The behavior information unrelated to the audio information in the behavior model information is driven according to the non-associated time node array.
The method according to claim 9, characterized in that the non-associated time nodes are set at equal time intervals or discrete time intervals.
The method according to claim 8, characterized in that, the first most relevant behavior information is lip behavior information; the second relevant behavior information is facial expression behavior information; the Nth behavior information that is sequentially related The information includes body movement behavior information.
An audio-driven device for avatar behavior, characterized in that it comprises:

A receiving module, configured to receive audio information;

A text information generating module, configured to generate text information according to the audio information;

A behavior model generation module, used to generate a behavior model according to the audio information and text information in combination with scene information;

An audio model association module, configured to associate the audio information, the text information, and the behavior model with time nodes to form audio model associated content;

A driving module, configured to drive the behavior of the avatar according to the associated content of the model;

The synchronization module is used for synchronizing the audio information and the behavior of the avatar.
The device according to claim 12, wherein the behavior model generation module is specifically used for:

According to the received audio information and the corresponding text information, combined with the scene information, according to the behavior of the avatar, the corresponding behavior models of mouth shapes, expressions and actions are generated according to time.
The device according to claim 13, wherein the scene information includes the semantics and context of the audio information and the context scene of the avatar's behavior.
The device according to claim 12, wherein the audio model association module is specifically used for:

According to the audio information, the text information corresponding to the audio information, and the behavior model, the key time nodes corresponding to the start time and the duration are associated to form audio model associated content associated with time nodes;

The associated content of the audio model includes the audio information, text information, behavior model content and association relationship.
The device according to claim 12, wherein the drive module is specifically used for:

Analyzing the associated content of the associated models to obtain the audio information, text information, and behavior model information, as well as the association relationship between the above information;

The behavior of the avatar is driven by the behavior model information in the model-associated content.
The device according to claim 12, wherein the synchronization module is specifically used for:

Establish a time axis with time node distribution;

Associating audio information, text information, and behavior model information through the time nodes according to the time nodes on the time axis to form an association relationship associated with the time nodes;

Synchronize the audio information and the behavior of the avatar through the audio information and the association relationship;

The time nodes include time nodes corresponding to the start time and the duration.
The device according to claim 17, wherein the synchronization module is further used for:

dividing the audio information into a plurality of segments, each segment having a respective start time and duration;

Synchronizing the corresponding text information and the audio information through the time nodes corresponding to the start time and duration;

Synchronize the first behavior information most related to the audio information in the behavior model information with the audio information through a time node corresponding to a start time and a duration.
The device according to claim 18, wherein the synchronization module is further used for:

Synchronizing second behavior information in the behavior model information that is secondarily related to the audio information with the audio information at a time node corresponding to a start time and a duration;

In this way, according to the relevant order, the Nth behavior information in the behavior model information that is sequentially related to the audio information is synchronized with the audio information through the time node corresponding to the start time and duration;

The N is a natural number greater than 1.
The device according to claim 19, wherein the synchronization module is further used for:

setting an array of non-associated time nodes on the time axis;

The behavior information unrelated to the audio information in the behavior model information is driven according to the non-associated time node array.
The device according to claim 20, wherein the non-associated time nodes are set as equal time intervals or discrete time intervals.
The device according to claim 19, characterized in that, the first most relevant behavior information is mouth-shaped behavior information; the second relevant second behavior information is facial expression behavior information; the Nth behavior information that is sequentially related The information includes body action behavior information.
An electronic device comprising:

memory for storing computer readable instructions; and

A processor, configured to run the computer-readable instructions, so that the electronic device implements the method according to any one of claims 1-11.
A computer storage medium, wherein at least one executable instruction is stored in the storage medium, and the executable instruction causes a processor to execute the steps of the method for audio-driven avatar behavior according to any one of claims 1-11.
A computer program, including instructions, when running on a computer, causes the computer to execute the audio-driven avatar behavior method according to any one of claims 1-11.