WO2021057424A1 - 基于文本的虚拟形象行为控制方法、设备和介质 - Google Patents

基于文本的虚拟形象行为控制方法、设备和介质 Download PDF

Info

Publication number
WO2021057424A1
WO2021057424A1 PCT/CN2020/113147 CN2020113147W WO2021057424A1 WO 2021057424 A1 WO2021057424 A1 WO 2021057424A1 CN 2020113147 W CN2020113147 W CN 2020113147W WO 2021057424 A1 WO2021057424 A1 WO 2021057424A1
Authority
WO
WIPO (PCT)
Prior art keywords
behavior
network
text
coding
content
Prior art date
Application number
PCT/CN2020/113147
Other languages
English (en)
French (fr)
Inventor
解静
李丕绩
段弘
Original Assignee
腾讯科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 腾讯科技(深圳)有限公司 filed Critical 腾讯科技(深圳)有限公司
Priority to EP20867870.6A priority Critical patent/EP3926525A4/en
Priority to JP2021564427A priority patent/JP7210774B2/ja
Publication of WO2021057424A1 publication Critical patent/WO2021057424A1/zh
Priority to US17/480,112 priority patent/US11714879B2/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/19Recognition using electronic means
    • G06V30/191Design or setup of recognition systems or techniques; Extraction of features in feature space; Clustering techniques; Blind source separation
    • G06V30/19173Classification techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F3/00Input arrangements for transferring data to be processed into a form capable of being handled by the computer; Output arrangements for transferring data from processing unit to output unit, e.g. interface arrangements
    • G06F3/01Input arrangements or combined input and output arrangements for interaction between user and computer
    • G06F3/011Arrangements for interaction with the human body, e.g. for user immersion in virtual reality
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/40Document-oriented image-based pattern recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2203/00Indexing scheme relating to G06F3/00 - G06F3/048
    • G06F2203/01Indexing scheme relating to G06F3/01
    • G06F2203/011Emotion or mood input determined on the basis of sensed human body parameters such as pulse, heart rate or beat, temperature of skin, facial expressions, iris, voice pitch, brain activity patterns
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/40Document-oriented image-based pattern recognition
    • G06V30/43Editing text-bitmaps, e.g. alignment, spacing; Semantic analysis of bitmaps of text without OCR

Definitions

  • the present disclosure relates to the technical field of artificial intelligence, and more specifically, to a text-based virtual image behavior control method, device, and medium.
  • AI Artificial Intelligence
  • the virtual image refers to the digitization of the human body structure through computer technology, and the visible and controllable virtual image body form appears on the computer screen.
  • the virtual image can be an image based on a real person, or an image based on a cartoon character. Both academia and industry are trying different ways to construct a virtual image that can serve and entertain the public 24 hours a day.
  • the embodiments of the present application provide a text-based virtual image behavior control method, device, and medium, which can control the virtual image to make expressions and actions that are compatible with the text and resemble a real person without being driven by a real person.
  • a text-based avatar behavior control method including: inserting a specific symbol in the text, and generating multiple input vectors corresponding to the specific symbol and each element in the text;
  • the specific symbol is a symbol used to represent text classification;
  • the multiple input vectors are respectively input to a first coding network, wherein the first coding network includes at least one layer of network nodes, and is based on the corresponding to the specific symbol
  • the attention vector of the network node determines the behavior trigger position in the text, wherein each element in the attention vector indicates the network node corresponding to the specific symbol to the network node in the same layer as the network node.
  • the attention weight of each network node determine the behavior content based on the first coding vector output from the first coding network and corresponding to the specific symbol; and play the audio corresponding to the text, and when it is played When the behavior trigger position, control the avatar to present the behavior content.
  • a text-based avatar behavior control device including: a vectorization device for inserting a specific symbol in the text, and generating corresponding to the specific symbol and each element in the text
  • the specific symbol is a symbol used to represent text classification
  • the behavior trigger position determining device is used to input the plurality of input vectors to the first coding network respectively, wherein the first coding network includes At least one layer of network nodes, and based on the attention vector of the network node corresponding to the specific symbol, determine the behavior trigger position in the text, wherein each element in the attention vector indicates the following and the The attention weight of the network node corresponding to the specific symbol to each network node in the same layer as the network node;
  • the behavior content determining device is configured to be based on the first encoding network output from the first encoding network and corresponding to the specific symbol An encoding vector for determining behavior content; and a behavior presentation device for playing audio corresponding to the text, and when the behavior trigger position is played, controlling the avatar to present
  • the behavior trigger position determining device is further configured to: for each layer of the first coding network, calculate the attention vector of the node corresponding to the specific symbol in the layer , Determine the average value of the attention vectors of all layers to obtain the average attention vector; and determine the behavior trigger position based on the index position of the element with the largest value in the average attention vector.
  • the first encoding network outputs a plurality of first encoding vectors corresponding to each input vector and fused with the semantics of each element of the context
  • the behavior content determining device is further configured To: input the first coding vector output from the first coding network and corresponding to the specific symbol into the first classification network; determine the behavior category corresponding to the text based on the output of the first classification network; And based on at least the behavior category, the behavior content is determined through a specific behavior mapping.
  • the specific behavior mapping includes a behavior mapping table, and determining the behavior content through a specific behavior mapping at least based on the behavior category further includes: , Find the behavior content corresponding to the behavior category, and determine it as the behavior content.
  • the specific behavior mapping is different for different application scenarios of the avatar.
  • the output of the first classification network is a behavior prediction vector
  • the dimension of the behavior prediction vector is the same as the number of behavior categories
  • each element of the behavior prediction vector represents the The text corresponds to the probability value of the corresponding behavior category.
  • the behavior content determination device is further configured to determine the behavior category corresponding to the text based on the output of the first classification network by executing the following processing: determine the behavior prediction The maximum probability value in the vector; and when the maximum probability value is greater than a predetermined threshold, the behavior category corresponding to the maximum probability value is taken as the behavior category corresponding to the text; otherwise, the behavior category corresponding to the maximum probability value
  • the specific categories with different behavior categories are determined as the behavior categories corresponding to the text.
  • the behavior content determining device is further configured to: input the plurality of input vectors to a second coding network; The second coding vector corresponding to the specific symbol is input to the second classification network; and based on the output of the second classification network, the emotion category corresponding to the text is determined, wherein the behavior content determination device is further configured to execute the following processing To achieve the determination of the behavior content based at least on the behavior category and through a specific behavior mapping: based on the behavior category and the emotional category, the behavior content is determined through a specific behavior mapping.
  • the behavior content includes at least one of action content and expression content.
  • the first coding network includes a third coding sub-network and a fourth coding sub-network
  • the behavior triggers The position determining device is further configured to: input the plurality of input vectors into a third encoding sub-network, wherein the third encoding sub-network includes at least one layer of network nodes, and is based on the corresponding to the specific symbol.
  • the attention vector of the network node in the third coding sub-network is determined, the expression trigger position in the text is determined; and the multiple input vectors are respectively input to the fourth coding sub-network, wherein the fourth coding sub-network At least one layer of network nodes is included, and the action trigger position in the text is determined based on the attention vector of the network node in the fourth coding sub-network corresponding to the specific symbol.
  • the behavior presentation device is further configured to adjust the behavior change parameters of the avatar based on the behavior content, so that the avatar changes continuously from non-presenting behavior content to Present the behavior content.
  • the behavior change parameter includes at least one of the following: behavior appearance time, behavior end time, behavior change coefficient.
  • a computer device including:
  • a memory connected to the processor; the memory stores machine-readable instructions; when the machine-readable instructions are executed by the processor, the processor executes the method described above.
  • a computer-readable storage medium having machine-readable instructions stored thereon.
  • the machine-readable instructions when executed by a processor, cause the processor to execute as described above. Methods.
  • Fig. 1 is a flowchart illustrating a specific process of a text-based avatar behavior control method according to an embodiment of the present disclosure
  • Figure 2 is a schematic diagram of the internal structure of the first coding network in some embodiments of the present application.
  • Figure 3 is a schematic diagram of the attention mechanism in some embodiments of the present application.
  • Figure 4 shows a schematic diagram of the input and output of the first coding network and the first classification network in some embodiments of the present application
  • FIG. 5 is a flowchart showing the specific process of S103 in FIG. 1;
  • FIG. 6 is a product flow chart showing the behavior control of an avatar according to an embodiment of the present disclosure.
  • FIG. 7 shows an example of an expression mapping table in some embodiments of the present application.
  • FIG. 8 shows a schematic diagram of a behavior generation process according to an embodiment of the present disclosure
  • FIG. 9 is a functional block diagram illustrating the configuration of a text-based avatar behavior control device according to an embodiment of the present disclosure.
  • FIG. 10 is a schematic diagram showing the architecture of an exemplary computing device according to an embodiment of the present disclosure.
  • Artificial Intelligence is a theory, method, technology and application system that uses digital computers or machines controlled by digital computers to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge, and use knowledge to obtain the best results.
  • Artificial intelligence software technology mainly includes computer vision technology, speech processing technology, natural language processing technology, and machine learning/deep learning.
  • Machine Learning is a multi-field interdisciplinary subject, involving probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and other subjects. Specializing in the study of how computers simulate or realize human learning behaviors in order to acquire new knowledge or skills, and reorganize the existing knowledge structure to continuously improve its own performance.
  • Machine learning is the core of artificial intelligence, the fundamental way to make computers intelligent, and its applications cover all fields of artificial intelligence.
  • Machine learning and deep learning usually include techniques such as artificial neural networks, belief networks, reinforcement learning, transfer learning, inductive learning, and style teaching learning.
  • the technical solutions for constructing virtual images are mainly divided into two categories.
  • One is the method driven by real people. Specifically, through the motion capture device, the data of the body and expression of the real actor is captured, and then the data is used to drive a 3D or 2D avatar to display these actions and expressions.
  • the second category is data-driven methods. Specifically, through TTS (Text To Speech), the avatar reads the input text content aloud. However, the avatar does not show any expressions and actions, which can only be applied to scenes that rarely require expressions and actions, such as news hosts.
  • TTS Text To Speech
  • the avatar is driven by data instead of real people to drive the avatar to present corresponding behaviors, so that it can run uninterruptedly and achieve thousands of people.
  • different categories of data are extracted based on the text, and then mapped to the behavior of the avatar, so that the triggered behavior is suitable for the current text, and compared with other technologies, the behavior is rich.
  • the behavior of the avatar is determined based on the predetermined mapping rules, the scalability is strong, and the behavior content can be continuously enriched. At the same time, only the mapping rules need to be updated to enable the avatar to present new behaviors.
  • the specific form of the virtual image can be a substitute image that is the same as a real person, or it can be a completely virtual cartoon image.
  • the avatar is the same as the stand-in image of the real announcer.
  • the avatar can not only generate news broadcast videos in a short period of time based on text, but also guarantee "zero error" in the news content. It can quickly start work regardless of various scenarios, and it can also broadcast 24 hours a day to help the media. Increased industry efficiency.
  • cartoon characters as different game characters can show rich behavior based on text, and can perform their role tasks 24 hours a day, such as 24-hour game explanations and 24-hour companion chats. Wait.
  • the method can be executed by an electronic device and includes the following operations.
  • S101 Insert a specific symbol into the text, and generate multiple input vectors corresponding to the specific symbol and each element in the text.
  • the text is usually a sentence.
  • the specific symbol may be a CLS (Classification) symbol used to represent text classification.
  • the specific symbol inserted in S101 may be an original vector corresponding to the CLS symbol.
  • the insertion position of the specific symbol in the text may be arbitrary. For example, the specific symbol may be inserted before the text, the specific symbol may be inserted after the text, or the specific symbol may be inserted in the middle of the text.
  • each element contained in the text is divided.
  • the element can be a character or a word.
  • the text can be divided in units of words.
  • the text can be segmented in units of words.
  • each element in the specific symbol and text is converted into a series of vectors that can express the semantics of the text, that is, each element in the specific symbol and text is mapped or embedded into another numerical vector space to generate the corresponding Multiple input vectors.
  • Fig. 2 shows a schematic diagram of the internal structure of the first coding network in some embodiments of the present application.
  • the input of the first coding network is the original vector of each character/word/specific symbol obtained in S101, and the output is the vector representation of each character/word/specific symbol fused with full-text semantic information.
  • the first network node in the first layer calculate the weighted sum of the input vector of the first element corresponding to the network node and the input vector of each element of its context as the encoding vector of the network node, and
  • the encoding vector is provided as an input to the first network node in the second layer until the first network node in the last layer, so as to obtain the final first encoded output after fusion of the semantic information of the full text.
  • the first coding network includes multi-layer network nodes. Of course, the present disclosure is not limited to this.
  • the first coding network may also include only one layer of network nodes.
  • the first encoding network may be implemented by a BERT (Bidirectional Encoder Representations from Transformer) model.
  • BERT Bidirectional Encoder Representations from Transformer
  • the goal of the BERT model is to use large-scale unlabeled corpus training to obtain a semantic representation (Representation) of the text that contains rich semantic information, and then fine-tune the semantic representation of the text in a specific Natural Language Processing (NLP) task. Finally applied to the NLP task.
  • the input of the BERT model is the original word vector of each word/word in the text obtained in S101, and the output is the vector representation of each word/word in the text fused with the semantic information of the full text.
  • the BERT model is a model based on the attention mechanism.
  • the main function of the attention mechanism is to allow the neural network to focus "attention" on a part of the input, that is, to distinguish the effects of different parts of the input on the output.
  • the attention mechanism will be understood from the perspective of enhancing the semantic representation of characters/words.
  • Figure 3 shows a schematic diagram of the attention mechanism in some embodiments of the present application.
  • the first element (character, word, or specific symbol) of the input is taken as an example to describe the calculation process of the attention mechanism.
  • the first element of the input is taken as the target element
  • the first network node in the first-level coding network corresponding to the first element is taken as the target network node.
  • the attention mechanism takes the semantic vector representation of the target element and each element of the context as input, and first obtains the query vector of the target element, the Key vector of each element of the context, and the original Value of each element of the target element and the context through a specific matrix transformation.
  • a Query vector is created based on the trained transformation matrix W Q
  • a Key vector and a Value vector are created based on the trained transformation matrix W K and W V, respectively.
  • these vectors are obtained by multiplying the input vector with three trained transformation matrices W Q , W K , and W V.
  • W Q the input vector
  • W K the output vector
  • W V the transformation matrices
  • i is an integer from 1 to n.
  • the attention vector of the target network node (that is, the target network node) in the first-layer coding network
  • the attention vector of the target network node Each element in indicates the attention weight from the target network node to each network node in the context (that is, each network node in the same layer).
  • the attention output of the target element is obtained.
  • the attention output of the target network node can be calculated according to the following formula:
  • the attention vector corresponding to the target network node is used as the weight, and the value vector of the target element input to the target network node and the Value vector of each element of the context are weighted and merged as the encoding output of the target network node, namely: The enhanced semantic vector representation of the target element.
  • the attention output shown in FIG. 3 corresponds to the encoding output of the first network node in the first-layer encoding network in FIG. 2.
  • the attention output shown in FIG. 3 is the final coding output corresponding to the first element of the input.
  • the attention output of the first network node of the first layer shown in FIG. 3 is provided as input to the first network of the second layer coding network Node, and according to a similar method, the encoding output of the first network node of the second-layer encoding network is obtained. Then, repeat the similar process layer by layer until the last layer.
  • the coding output of the first network node in the last layer of coding network is the final coding output corresponding to the first element of the input.
  • the attention vector of the network node corresponding to the target element is calculated in each layer.
  • the attention vector of the network node corresponding to the target element is used as the weight, and all the vectors input to the layer are weighted and summed, and the resulting weighted sum is used as the output of the current layer that incorporates contextual semantics Encoding vector. Then, the output of the current layer is further used as the input of the next layer, and the same processing is repeated.
  • L attention vectors corresponding to the target element will be obtained The L attention vectors respectively correspond to the L-layer coding network.
  • each element in the attention vector of the network node corresponding to the specific symbol indicates the attention weight from the network node corresponding to the specific symbol to each network node in the same layer.
  • the network node corresponding to the specific symbol is the first network node in each layer of coding network, and the network node corresponding to the specific symbol
  • the attention vector of includes the attention vector of the first network node in each layer.
  • the behavior may include at least one of an action and an expression. Since the avatar is based on text to make corresponding expressions or actions, it is not only necessary to determine the specific content of the behavior that the avatar should present based on the text, but also to determine which element of the text (word/ The audio corresponding to the word) presents the corresponding behavior. The element position in the text corresponding to the moment when the avatar presents the corresponding behavior is the behavior trigger position.
  • the contextual word/word information is used to enhance the semantic representation of the target word/word.
  • a CLS (Classification) symbol for representing text classification is further inserted.
  • the inserted CLS symbols do not have obvious semantic information. Therefore, this symbol without obvious semantic information will more "fairly" integrate the semantic information of each word/word in the text. Therefore, the weight value of each element in the attention vector of the network node corresponding to the CLS symbol can reflect the importance of each word/word in the text. If the attention weight value is larger, it indicates that the importance of the corresponding character/word is higher.
  • the avatar it is considered appropriate to control the avatar to present the corresponding behavior at the most important character/word position in the text. Therefore, the most important word/word position in the text is used as the behavior trigger position. Since the attention vector of the network node corresponding to the specific symbol can reflect the importance of each word/word in the text, the behavior in the text can be determined based on the attention vector of the network node corresponding to the specific symbol Trigger position.
  • this formula indicates that the index i when a 1i takes the maximum value is assigned to p.
  • determining the behavior trigger position in the text based on the attention vector of the network node corresponding to the specific symbol in S102 further includes: calculating all of the first coding network The average value of the attention vector from the node corresponding to the specific symbol to each node in the layer to obtain the average attention vector; and the behavior is determined based on the index position of the element with the largest value in the average attention vector Trigger position.
  • the first coding network has multiple layers of network nodes, there is a network node corresponding to the specific symbol in each layer, and in each layer, the calculation and the The attention vector of the network node corresponding to a specific symbol. Assuming that the first coding network shares L layers, then L attention vectors of L network nodes corresponding to the specific symbol will be obtained In this case, first average the L attention vectors to obtain the average attention vector
  • the formula means that The index i when the maximum value is obtained is assigned to p.
  • the above describes how to determine the behavior trigger position of the avatar based on the first coding network. After determining the behavior trigger position of the avatar, it is also necessary to determine the behavior content that the avatar needs to present.
  • the behavior content corresponding to the text is determined based on the coding vector corresponding to the specific symbol output from the first coding network.
  • the first coding network outputs a plurality of first coding vectors corresponding to each input vector and incorporating the semantics of each element of the context. Since a specific symbol CLS without obvious semantic information is inserted into the input provided to the first coding network, and this symbol without obvious semantic information will more "fairly" integrate the semantic information of each word/word in the text, it will be compared with The first code vector corresponding to the specific symbol is used as the semantic representation of the whole sentence text, so as to be used for text classification.
  • FIG. 4 shows a schematic diagram of the input and output of the first coding network and the first classification network in some embodiments of the present application.
  • FIG. 5 shows the specific process of S103 in FIG. 1.
  • determining the behavior content further includes the following operations.
  • the first classification network may be a single-layer neural network or a multi-layer neural network. Moreover, when there are multiple categories to be classified, the first classification network can be adjusted to have more output neurons, and then normalized to a value ranging from 0 to 1 through the softmax function. Specifically, the output of the first classification network It is a behavior prediction vector of the same dimension as the number of behavior categories, where each element represents the probability value of the text corresponding to the corresponding behavior category.
  • h CLS is provided as an input vector to the first classification network, and the first classification network can output the probability value of the text corresponding to each type of behavior category:
  • W represents the weight of the network node in the first classification network
  • b is the offset constant.
  • the category i corresponding to the highest probability in, is the behavior category to which the text belongs.
  • determining the behavior category may include: determining a maximum probability value in the behavior prediction vector; when the maximum probability value is greater than a predetermined threshold, The behavior category corresponding to the maximum probability value is taken as the behavior category corresponding to the text; otherwise, a specific category different from the behavior category corresponding to the maximum probability value is determined as the behavior category corresponding to the text.
  • the confidence level of the behavior prediction result of the first classification network is further judged. If the maximum probability value If it is less than the predetermined threshold, it is considered that the confidence of the behavior prediction result output by the first classification network is low. In this case, the prediction result of the first classification network is not used, but the behavior category to which the text belongs is determined as a specific category different from the behavior category corresponding to the maximum probability value. For example, the specific category may be a neutral category. On the other hand, if the maximum probability value If it is greater than the predetermined threshold, it is considered that the confidence of the behavior prediction result output by the first classification network is high. In this case, the prediction result of the first classification network is used.
  • S503 Determine the content of the behavior at least based on the behavior category through a specific behavior mapping.
  • the specific behavior mapping includes a behavior mapping table.
  • the content of the behavior can be determined based on the behavior category by looking up a preset mapping table.
  • determining the behavior content through a specific behavior mapping at least based on the behavior category further includes: searching the behavior content corresponding to the behavior category in the behavior mapping table, and determining it as the behavior content. The content of the behavior.
  • the specific behavior mapping is different.
  • the mapping table corresponding to the news scene will not trigger more exaggerated behavior content.
  • the text is provided to the first coding network, and the behavior trigger position is estimated based on the attention mechanism of the first coding network.
  • the output vector of the first coding network is further input to the first classification network, and the prediction result of the behavior category to which the text belongs is obtained from the first classification network.
  • the BERT model can be used to implement the first coding network.
  • Both the first coding network and the first classification network mentioned above need to be trained.
  • a large-scale text corpus that is not related to a specific NLP task is usually used for pre-training.
  • the goal is to learn what the language itself should be. This is like when we study Chinese, English and other language courses, we all need to learn how to choose and combine the vocabulary we have mastered to generate a smooth text.
  • the pre-training process is to gradually adjust the model parameters, so that the semantic representation of the text output by the model can portray the essence of the language, which is convenient for subsequent fine-tuning of specific NLP tasks.
  • a Chinese news corpus of about 200G can be used for pre-training the word-based Chinese BERT model.
  • the specific NLP task is a text classification task.
  • the pre-trained BERT model and the first classification network are jointly trained.
  • the focus is on the training of the first classification network, and the BERT model is very little changed.
  • This training process becomes fine-tuning.
  • supervised learning in machine learning is involved. This means that a labeled data set is needed to train such a model.
  • the microblog data with Emoji mark can be captured as a marked data set.
  • the text posted by the user usually contains the corresponding Emoji expression.
  • the smiling Emoji expression category can be used as the correct expression category of the text.
  • the emoji action category of the fist can be used as the correct expression category of the text.
  • the optimization of the first classification network can be obtained by minimizing the cross-entropy loss function.
  • the behavior content may include at least one of action content and expression content.
  • the behavior content may include only action content, or only emoticon content, or may include both action content and emoticon content.
  • the content of the action may include, but is not limited to: comparision, gesture, mouth curling, yawning, nose picking, etc.
  • the expression content may include, but is not limited to: smile, frown, disdain, laugh, etc.
  • the first coding network described above may further include a third coding sub-network corresponding to the action and a fourth coding sub-network corresponding to the expression.
  • Inputting the multiple input vectors into the first coding network, and determining the behavior trigger position in the text based on the attention vector of the network node corresponding to the specific symbol further includes: adding the multiple input vectors Respectively input to a third coding sub-network, wherein the third coding sub-network includes at least one layer of network nodes, and is based on the attention vector of the network nodes in the third coding sub-network corresponding to the specific symbol, Determining the action trigger position in the text; and inputting the multiple input vectors into a fourth coding sub-network, wherein the fourth coding sub-network includes at least one layer of network nodes, and is based on corresponding to the specific symbol
  • the attention vector of the network node in the fourth coding sub-network determines the trigger position of the expression in
  • the two coding sub-networks have the same number of parameters, but the values of the parameters are different.
  • the specific structure and configuration are similar to the first coding network described above, and will not be repeated here. Therefore, for the same text, based on different coding sub-networks, the obtained action trigger position and expression trigger position are different.
  • the first classification network also further includes a third classification sub-network corresponding to actions and a fourth classification sub-network corresponding to expressions.
  • the two classification sub-networks have the same number of parameters, but the values of the parameters are different.
  • the specific structure and configuration are similar to the first classification network described above, and will not be repeated here.
  • the expression mapping table and the action mapping table can be set in advance, and then based on the expression category and the behavior category, the expression mapping table is searched to determine the corresponding expression content, and Based on the expression category and the behavior category, look up the action mapping table to determine the corresponding action content.
  • the method according to the present disclosure may further include the following operations: respectively inputting the multiple input vectors to a second coding network; outputting from the second coding network and corresponding to the specific symbol
  • the second coding vector is input to the second classification network; and based on the output of the second classification network, the emotion category is determined.
  • emotion categories may include, but are not limited to: angry, happy, and so on.
  • the second coding network is similar to the first coding network, and the number of parameters of the two networks is the same, but the parameter values can be the same or different depending on the situation.
  • the behavior content only includes expression content
  • the parameters of the first coding network and the second coding network may be the same.
  • the behavior content only includes the action content
  • the parameters of the first coding network and the second coding network may be different.
  • the second coding network and the second classification network also need to be trained, and the training method is similar to the training method described above.
  • the Weibo data with Emoji expressions can be used as labeled data for training emotion categories.
  • determining the behavior content through a specific behavior mapping at least based on the behavior category further includes: determining the behavior content through a specific behavior mapping based on the behavior category and the emotion category.
  • the emotional category can be regarded as a further dimension of independent variable based on the behavior category to determine the final behavior content.
  • Fig. 6 shows a product flow chart of avatar behavior control according to an embodiment of the present disclosure.
  • the behavior content may include both action content and expression content
  • the action category, expression category and emotion category and corresponding action trigger position and expression trigger position are respectively extracted based on the text. .
  • the text is processed by algorithms to obtain the expression, action and emotion corresponding to each sentence of the text.
  • emoji and actions can choose Emoji, which are widely used at present.
  • Emotion is the classification of the emotion contained in the text, such as angry, happy, etc.
  • the triggering of expressions and actions is accurate to words or words, that is, a certain word or word in the text will trigger the prescribed actions and expressions.
  • Fig. 7 shows an example of an expression mapping table.
  • the example shown in FIG. 7 corresponds to a situation with three parameters of action, expression, and emotion.
  • the corresponding existing live broadcast expression ID represents the expression to be presented by the avatar, and the action ID, expression ID and emotion ID respectively correspond to expressions, actions and emotions determined based on text.
  • Fig. 8 shows a schematic diagram of a behavior generation process according to an embodiment of the present disclosure.
  • the behavior includes both actions and expressions, and the action category, the expression category, and the emotion category and the corresponding action trigger position and expression trigger position are respectively extracted based on the text. Then, based on the action category, expression category, and emotion category, through specific mapping rules, determine the action content and expression content that the avatar should present.
  • Both the action model and the expression model in Figure 8 can be implemented by the first coding network and the first classification network described above, but it depends on the specific action model, expression model, and emotion model.
  • the corresponding specific network parameters are: The difference.
  • mapping rules can be combined with the current scene where the avatar is located for further screening.
  • mapping rules corresponding to news scenes will not trigger more exaggerated actions and expressions.
  • FIG. 8 shows an action model, an expression model, and an emotion model, as described above, the present disclosure is not limited thereto.
  • extracting only action categories based on text extracting only expression categories, extracting action categories and emotion categories, extracting expression categories and emotion categories, extracting action categories and expression categories and other combination variants are also included in the scope of the present disclosure.
  • the audio corresponding to the text is played, and when the behavior trigger position is played, the avatar is controlled to present the behavior content.
  • the triggered behavior can be further fine-tuned.
  • controlling the avatar to present the behavior content further includes: adjusting the behavior change parameters of the avatar based on the behavior content, so that the avatar continuously changes from not presenting the behavior content to presenting the behavior content.
  • each behavior change parameter can be adjusted.
  • the adjustable behavior change parameters include but are not limited to behavior appearance time, behavior end time, behavior change coefficient, etc., so as to ensure that each behavior change is naturally coherent and personified.
  • the following is an example of the program code used to adjust the behavior change parameters. In this code, taking the expression as an example, it shows the specific adjustment parameter settings, including waiting for a predetermined period of time before making an expression, expression fade in, expression holding time period, expression fade out, etc., to ensure that every expression changes It is naturally coherent and anthropomorphic.
  • the ratio of the degree of expression recovery to the pinched expression coefficient is between 0.1 and 0.3*/,
  • the text-based avatar behavior control method is described in detail with reference to FIGS. 1 to 8. It can be seen that in the method according to the present disclosure, the avatar is driven by data instead of real people to drive the corresponding behavior, so it can run uninterruptedly and achieve a thousand faces.
  • different categories of data are extracted based on the text, and then mapped to the behavior of the avatar, so that the triggered behavior is suitable for the current text, and compared with other technologies, the behavior is rich.
  • the behavior of the avatar is determined based on the predetermined mapping rules, the scalability is strong, and the behavior content can be continuously enriched. At the same time, the avatar can present new behaviors only by updating the mapping rules.
  • using the BERT model to implement the first coding network can not only estimate the behavior trigger position based on the attention mechanism, but also improve the accuracy of text classification.
  • Table 1 below shows the accuracy of the BERT-based text classification model and the CNN-based text classification model in the classification of actions, expressions, and emotions.
  • the device 1000 includes: a vectorization device 1001, a behavior trigger location determination device 1002, a behavior content determination device 1003, and a behavior presentation device 1004.
  • the vectorization device 1001 is used for inserting a specific symbol into the text, and generating multiple input vectors corresponding to the specific symbol and each element in the text, and the specific symbol is a symbol for representing text classification.
  • the text is usually a sentence.
  • the specific symbol may be a CLS (Classification) symbol for representing text classification.
  • the insertion position of the specific symbol in the text may be arbitrary.
  • the specific symbol may be inserted before the text, the specific symbol may be inserted after the text, or the specific symbol may be inserted in the middle of the text.
  • the vectorization device 1001 divides each element contained in the text.
  • the element can be a character or a word.
  • the text can be divided in units of words.
  • the text can be segmented in units of words.
  • the vectorization device 1001 converts each element in the specific symbol and text into a series of vectors capable of expressing the semantics of the text, that is, maps or embeds each element in the specific symbol and text into another numerical vector space , So as to generate the corresponding multiple input vectors.
  • the behavior trigger position determining device 1002 is configured to input the multiple input vectors into a first coding network, wherein the first coding network includes at least one layer of network nodes, and is based on the attention of the network node corresponding to the specific symbol A force vector to determine the behavior trigger position in the text, wherein each element in the attention vector indicates from the network node corresponding to the specific symbol to each network node in the same layer as the network node Attention weight.
  • the first coding network can be implemented by the BERT model.
  • the avatar is based on text to make corresponding expressions or actions, it is not only necessary to determine the specific content of the behavior that the avatar should present based on the text, but also to determine where the avatar should be played to the text
  • the audio corresponding to an element presents the corresponding behavior.
  • the element position in the text corresponding to the moment when the avatar presents the corresponding behavior is the behavior trigger position.
  • the contextual word/word information is used to enhance the semantic representation of the target word/word.
  • a CLS (Classification) symbol for representing text classification is further inserted.
  • the inserted CLS symbols do not have obvious semantic information. Therefore, this symbol without obvious semantic information will more "fairly" integrate the semantic information of each word/word in the text. Therefore, the weight value of each element in the attention vector of the network node corresponding to the CLS symbol can reflect the importance of each word/word in the text. If the attention weight value is larger, it indicates that the importance of the corresponding character/word is higher.
  • the behavior trigger position determining device 1002 uses the most important character/word position in the text as the behavior trigger position. Since the attention vector of the network node corresponding to the specific symbol can reflect the importance of each word/word in the text, the behavior trigger position determining device 1002 can determine based on the attention vector of the network node corresponding to the specific symbol The behavior trigger position in the text.
  • the behavior trigger position determining device 1002 is further configured to: determine the text in the text based on the attention vector of the network node corresponding to the specific symbol The behavior trigger position.
  • the behavior trigger position determining device 1002 is further configured to: for each layer of the first coding network, calculate the value of the node corresponding to the specific symbol in the layer.
  • the attention vector determines the average value of the attention vectors of all layers to obtain the average attention vector; and determines the behavior trigger position based on the index position of the element with the largest value in the average attention vector.
  • the behavior content determining device 1003 is configured to determine behavior content based on the first coding vector output from the first coding network and corresponding to the specific symbol.
  • the first coding network outputs multiple first coding vectors corresponding to each input vector and incorporating the semantics of each element of the context. Since a specific symbol CLS without obvious semantic information is inserted into the input provided to the first coding network, and this symbol without obvious semantic information will more "fairly" integrate the semantic information of each word/word in the text, it will be compared with The output first code vector corresponding to the specific symbol is used as the semantic representation of the whole sentence text, so as to be used for text classification.
  • the behavior content determining device 1003 is further configured to: input a first coding vector output from the first coding network and corresponding to the specific symbol into a first classification network; based on the output of the first classification network , Determine the behavior category corresponding to the text; and determine the behavior content through a specific behavior mapping based at least on the behavior category.
  • the first classification network may be a single-layer neural network or a multi-layer neural network. Moreover, when there are multiple categories to be classified, the first classification network can be adjusted to have more output neurons, and then normalized to a value ranging from 0 to 1 through the softmax function. Specifically, the output of the first classification network is a behavior prediction vector of the same dimension as the number of behavior categories, wherein each element represents the probability value of the text corresponding to the corresponding behavior category.
  • the behavior content determination device 1003 uses the category corresponding to the maximum probability in the behavior prediction vector as the behavior category to which the text belongs.
  • the behavior content determining device 1003 is further configured to perform the following processing to determine the behavior category based on the output of the first classification network: determine the behavior prediction vector A maximum probability value; and when the maximum probability value is greater than a predetermined threshold, the behavior category corresponding to the maximum probability value is taken as the behavior category corresponding to the text; otherwise, the behavior category corresponding to the maximum probability value is different The specific category of is determined as the behavior category corresponding to the text.
  • the behavior content determining device 1003 further determines the confidence level of the behavior prediction result of the first classification network. If the maximum probability value is less than the predetermined threshold, the behavior content determination device 1003 considers that the confidence level of the behavior prediction result output by the first classification network is low. In this case, the behavior content determination device 1003 does not use the prediction result of the first classification network, but determines the behavior category to which the text belongs as a specific category different from the behavior category corresponding to the maximum probability value. For example, the specific category may be a neutral category. On the other hand, if the maximum probability value is greater than the predetermined threshold, the behavior content determination device 1003 considers that the confidence level of the behavior prediction result output by the first classification network is high. In this case, the behavior content determination device 1003 uses the prediction result of the first classification network.
  • the behavior content determining device 1003 determines the behavior content based on at least the behavior category and through a specific behavior mapping. For example, the behavior content can be determined based on the behavior category by searching a preset mapping table.
  • the behavior content may include at least one of action content and expression content.
  • the behavior content may include only action content, or only emoticon content, or may include both action content and emoticon content.
  • the content of the action may include, but is not limited to: comparision, gesture, mouth curling, yawning, nose picking, etc.
  • the expression content may include, but is not limited to: smile, frown, disdain, laugh, etc.
  • the first coding network described above may further include a third coding sub-network corresponding to the action and a fourth coding sub-network corresponding to the expression.
  • the two coding sub-networks have the same number of parameters, but the values of the parameters are different.
  • the specific structure and configuration are similar to the coding network described above, and will not be repeated here. Therefore, for the same text, based on different coding sub-networks, the obtained action trigger position and expression trigger position are different.
  • the first classification network also further includes a third classification sub-network corresponding to actions and a fourth classification sub-network corresponding to expressions.
  • the two classification sub-networks have the same number of parameters, but the values of the parameters are different.
  • the specific structure and configuration are similar to the first classification network described above, and will not be repeated here.
  • the expression mapping table and the action mapping table may be set in advance, and then the behavior content determination device 1003 searches the expression mapping table based on the expression category and the behavior category. Determine the corresponding expression content, and based on the expression category and behavior category, look up the action mapping table to determine the corresponding action content.
  • the emotional category to which the text belongs may be further determined based on the text.
  • the behavior content determining device 1003 is further configured to: input the multiple input vectors to a second coding network respectively; and output from the second coding network corresponding to the specific symbol The second coding vector of is input to a second classification network; and based on the output of the second classification network, the sentiment category to which the text belongs is determined.
  • the behavior content determining device 1003 is further configured to perform the following processing to determine the behavior content at least based on the behavior category and through a specific behavior mapping: based on the behavior category and the emotion category, through Specific behavior mapping determines the content of the behavior.
  • the emotional category can be regarded as a further dimension of independent variable based on the behavior category to determine the final behavior content.
  • the behavior presentation device 1004 is used to play the audio corresponding to the text, and when playing When reaching the behavior trigger position, control the avatar to present the behavior content.
  • the behavior presentation device 1004 can further respond to the triggered behavior Make fine adjustments.
  • the behavior presentation device 1004 may be further configured to adjust the behavior change parameters of the avatar based on the behavior content, so that the avatar changes continuously from not presenting the behavior content to presenting the behavior content.
  • adjustable behavior change parameters include but are not limited to behavior appearance time, behavior end time, behavior change coefficient, etc., so as to ensure that each behavior change is naturally coherent and personified.
  • the avatar is driven by data instead of real people to drive the corresponding behaviors, so it can run uninterruptedly and achieve thousands of people.
  • different categories of data are extracted based on the text, and then mapped to the behavior of the avatar, so that the triggered behavior is suitable for the current text, and compared with other technologies, the behavior is rich.
  • the behavior of the avatar is determined based on the predetermined mapping rules, the scalability is strong, and the behavior content can be continuously enriched. At the same time, only the mapping rules need to be updated to enable the avatar to present new behaviors.
  • the use of the BERT model to implement the coding network can not only estimate the behavior trigger position based on the attention mechanism, but also improve the accuracy of text classification.
  • the avatar behavior control device completely corresponds to the avatar behavior control method described above, in the description of the avatar behavior control device, many details are not expanded. Those skilled in the art can understand that all the details of the avatar behavior control method described above can be similarly applied to the avatar behavior control device.
  • the method or device according to the embodiment of the present disclosure may also be implemented with the aid of the architecture of the computing device 1100 shown in FIG. 10.
  • the computing device 1100 may include a bus 1110, one or more CPUs 1120, a read only memory (ROM) 1130, a random access memory (RAM) 1140, a communication port 1150 connected to a network, and input/output components. 1160, hard disk 1170, etc.
  • the storage device in the computing device 1100 such as the ROM 1130 or the hard disk 1170, may store various data or files used in the processing and/or communication of the avatar behavior control method provided by the present disclosure and the program instructions executed by the CPU.
  • the architecture shown in FIG. 10 is only exemplary. When implementing different devices, one or more components in the computing device shown in FIG. 10 may be omitted according to actual needs.
  • the embodiments of the present disclosure can also be implemented as a computer-readable storage medium.
  • Computer-readable instructions are stored on the computer-readable storage medium according to an embodiment of the present disclosure.
  • the avatar behavior control method according to the embodiments of the present disclosure described with reference to the above drawings can be executed.
  • the computer-readable storage medium includes, but is not limited to, for example, volatile memory and/or non-volatile memory.
  • the volatile memory may include random access memory (RAM) and/or cache memory (cache), for example.
  • the non-volatile memory may include, for example, read-only memory (ROM), hard disk, flash memory, and the like.
  • the avatar behavior control method and device have been described in detail with reference to FIGS. 1 to 10.
  • the avatar behavior control method and device according to the embodiments of the present disclosure, the avatar is driven to present the corresponding behavior by data driving instead of a real person, so that it can run uninterruptedly and achieve thousands of people.
  • different categories of data are extracted based on the text, and then mapped to the behavior of the avatar, so that the triggered behavior is suitable for the current text, and compared with other technologies, the behavior is rich.
  • the behavior of the avatar is determined based on the predetermined mapping rules, the scalability is strong, and the behavior content can be continuously enriched. At the same time, only the mapping rules need to be updated to enable the avatar to present new behaviors.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Multimedia (AREA)
  • Probability & Statistics with Applications (AREA)
  • Computational Linguistics (AREA)
  • Mathematical Physics (AREA)
  • Biophysics (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Biomedical Technology (AREA)
  • Software Systems (AREA)
  • Human Computer Interaction (AREA)
  • Processing Or Creating Images (AREA)
  • Information Transfer Between Computers (AREA)
  • Machine Translation (AREA)
  • User Interface Of Digital Computer (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

公开了基于文本的虚拟形象行为控制方法、设备和介质。所述方法包括:在文本中插入特定符号,并生成与所述特定符号和文本中的各个元素对应的多个输入向量;将所述多个输入向量分别输入至第一编码网络,并且基于与所述特定符号对应的网络节点的注意力向量,确定所述文本中的行为触发位置;基于从所述第一编码网络输出的、与所述特定符号对应的第一编码向量,确定行为内容;以及播放与所述文本对应的音频,并且当播放到所述行为触发位置时,控制所述虚拟形象呈现所述行为内容。

Description

基于文本的虚拟形象行为控制方法、设备和介质
本申请要求于2019年9月23日提交中国专利局、申请号为201910898521.6,发明名称为“基于文本的虚拟形象行为控制方法、设备和介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本公开涉及人工智能的技术领域,更具体地说,涉及基于文本的虚拟形象行为控制方法、设备和介质。
背景技术
随着人工智能(Artificial Intelligence,AI)各方向不同能力的发展,大众已渐渐不满足于在实际场景中只应用某个AI能力,因此对于AI综合能力应用场景的探索也在不断推进。近些年,虚拟形象作为AI综合能力的一种展示方式,不断被大众提及。虚拟形象是指通过计算机技术,将人体结构数字化,在电脑屏幕上出现看得见的、能够调控的虚拟形象体形态。虚拟形象可以是基于真实人得到的形象,也可以是基于卡通人物得到的形象。学术界和工业界都在尝试用不同的方式构造一个能够24小时服务大众和娱乐大众的虚拟形象。
技术内容
本申请实施例提供了一种基于文本的虚拟形象行为控制方法、设备和介质,其能够在无真人驱动的情况下,控制虚拟形象做出与文本相适应的、类似真人的表情和动作。
根据本公开的一个方面,提供了一种基于文本的虚拟形象行为控制 方法,包括:在文本中插入特定符号,并生成与所述特定符号和文本中的各个元素对应的多个输入向量;所述特定符号为用于表示文本分类的符号;将所述多个输入向量分别输入至第一编码网络,其中所述第一编码网络包括至少一层网络节点,并且基于与所述特定符号对应的网络节点的注意力向量,确定所述文本中的行为触发位置,其中,所述注意力向量中的每一个元素分别指示从与所述特定符号对应的网络节点到与该网络节点同一层中的每一个网络节点的注意力权重;基于从所述第一编码网络输出的、与所述特定符号对应的第一编码向量,确定行为内容;以及播放与所述文本对应的音频,并且当播放到所述行为触发位置时,控制所述虚拟形象呈现所述行为内容。
根据本公开的另一方面,提供了一种基于文本的虚拟形象行为控制设备,包括:向量化装置,用于在文本中插入特定符号,并生成与所述特定符号和文本中的各个元素对应的多个输入向量,所述特定符号为用于表示文本分类的符号;行为触发位置确定装置,用于将所述多个输入向量分别输入至第一编码网络,其中所述第一编码网络包括至少一层网络节点,并且基于与所述特定符号对应的网络节点的注意力向量,确定所述文本中的行为触发位置,其中,所述注意力向量中的每一个元素分别指示从与所述特定符号对应的网络节点到与该网络节点同一层中的每一个网络节点的注意力权重;行为内容确定装置,用于基于从所述第一编码网络输出的、与所述特定符号对应的第一编码向量,确定行为内容;以及行为呈现装置,用于播放与所述文本对应的音频,并且当播放到所述行为触发位置时,控制所述虚拟形象呈现所述行为内容。
另外,在根据本公开的设备中,所述行为触发位置确定装置进一步被配置为:针对所述第一编码网络的每一层,计算该层中与所述特定符号对应的节点的注意力向量,确定所有层的注意力向量的平均值,以得 到平均注意力向量;以及基于所述平均注意力向量中数值最大的元素的索引位置,确定所述行为触发位置。
另外,在根据本公开的设备中,所述第一编码网络输出与各输入向量对应的、融合了上下文各个元素的语义的多个第一编码向量,并且其中所述行为内容确定装置进一步被配置为:将从所述第一编码网络输出的、与所述特定符号对应的第一编码向量输入至第一分类网络;基于所述第一分类网络的输出,确定所述文本对应的行为类别;以及至少基于所述行为类别,通过特定的行为映射,确定所述行为内容。
另外,在根据本公开的设备中,所述特定的行为映射包括行为映射表,并且其中至少基于所述行为类别,通过特定的行为映射,确定所述行为内容进一步包括:在所述行为映射表中,查找与所述行为类别对应的行为内容,并将其确定为所述行为内容。
另外,在根据本公开的设备中,针对所述虚拟形象的不同应用场景,所述特定的行为映射是不同的。
另外,在根据本公开的设备中,所述第一分类网络的输出为行为预测向量,所述行为预测向量的维度与行为类别的数目相同,其中所述行为预测向量的每一个元素表示所述文本对应于相应的行为类别的概率值。
另外,在根据本公开的设备中,所述行为内容确定装置进一步被配置为通过执行以下处理来实现基于所述第一分类网络的输出,确定所述文本对应的行为类别:确定所述行为预测向量中的最大概率值;以及当所述最大概率值大于预定阈值时,将所述最大概率值对应的行为类别作为与所述文本对应的行为类别,否则,将与所述最大概率值对应的行为类别不同的特定类别确定为与所述文本对应的行为类别。
另外,在根据本公开的设备中,所述行为内容确定装置进一步被配 置为:将所述多个输入向量分别输入至第二编码网络;将从所述第二编码网络输出的、与所述特定符号对应的第二编码向量输入至第二分类网络;以及基于所述第二分类网络的输出,确定所述文本对应的情感类别,其中所述行为内容确定装置进一步被配置为通过执行以下处理来实现至少基于所述行为类别,通过特定的行为映射,确定所述行为内容:基于所述行为类别和所述情感类别,通过特定的行为映射,确定所述行为内容。
另外,在根据本公开的设备中,所述行为内容包括动作内容和表情内容中的至少一个。
另外,在根据本公开的设备中,当所述行为内容包括动作内容和表情内容二者时,所述第一编码网络包括第三编码子网络和第四编码子网络,并且其中所述行为触发位置确定装置进一步被配置为:将所述多个输入向量分别输入至第三编码子网络,其中所述第三编码子网络包括至少一层网络节点,并且基于与所述特定符号对应的、所述第三编码子网络中的网络节点的注意力向量,确定所述文本中的表情触发位置;以及将所述多个输入向量分别输入至第四编码子网络,其中所述第四编码子网络包括至少一层网络节点,并且基于与所述特定符号对应的、所述第四编码子网络中的网络节点的注意力向量,确定所述文本中的动作触发位置。
另外,在根据本公开的设备中,所述行为呈现装置进一步被配置为:基于所述行为内容,调整所述虚拟形象的行为变化参数,使得所述虚拟形象从不呈现行为内容连贯地变化到呈现所述行为内容。
另外,在根据本公开的设备中,所述行为变化参数至少包括以下之一:行为出现时间、行为结束时间、行为变化系数。
根据本公开的再一方面,公开了一种计算机设备,包括:
处理器;
与所述处理器相连接的存储器;所述存储器中存储有机器可读指令;所述机器可读指令在被处理器执行时,使得所述处理器执行如上文中所述的方法。
根据本公开的再一方面,公开了一种计算机可读存储介质,其上存储有机器可读指令,所述机器可读指令在被处理器执行时,使得所述处理器执行如上文中所述的方法。
附图简要说明
图1是图示根据本公开实施例的、基于文本的虚拟形象行为控制方法的具体过程的流程图;
图2是本申请一些实施例中所述第一编码网络的内部结构的示意图;
图3是本申请一些实施例中注意力机制的示意图;
图4示出了本申请一些实施例中第一编码网络和第一分类网络的输入输出示意图;
图5是示出了图1中的S103的具体过程的流程图;
图6是示出了根据本公开的一种实施例的虚拟形象行为控制的产品流程图;
图7示出了本申请一些实施例中表情映射表的一种示例;
图8示出了根据本公开的一种实施例的行为生成流程的示意图;
图9是图示根据本公开的实施例的基于文本的虚拟形象行为控制设备的配置的功能性框图;以及
图10是示出了根据本公开实施例的一种示例性的计算设备的架构的示意图。
具体实施方式
下面将参照附图对本申请的各个实施方式进行描述。提供以下参照附图的描述,以帮助对由权利要求及其等价物所限定的本申请的示例实施方式的理解。其包括帮助理解的各种具体细节,但它们只能被看作是示例性的。因此,本领域技术人员将认识到,可对这里描述的实施方式进行各种改变和修改,而不脱离本申请的范围和精神。而且,为了使说明书更加清楚简洁,将省略对本领域熟知功能和构造的详细描述。
人工智能(Artificial Intelligence,AI)是利用数字计算机或者数字计算机控制的机器模拟、延伸和扩展人的智能,感知环境、获取知识并使用知识获得最佳结果的理论、方法、技术及应用系统。人工智能软件技术主要包括计算机视觉技术、语音处理技术、自然语言处理技术以及机器学习/深度学习等几大方向。
机器学习(Machine Learning,ML)是一门多领域交叉学科,涉及概率论、统计学、逼近论、凸分析、算法复杂度理论等多门学科。专门研究计算机怎样模拟或实现人类的学习行为,以获取新的知识或技能,重新组织已有的知识结构使之不断改善自身的性能。机器学习是人工智能的核心,是使计算机具有智能的根本途径,其应用遍及人工智能的各个领域。机器学习和深度学习通常包括人工神经网络、置信网络、强化学习、迁移学习、归纳学习、式教学习等技术。
本申请实施例提供的方案涉及人工智能的机器学习等技术,具体通过以下实施例进行说明。
通常,构造虚拟形象的技术方案主要分为两大类。一类是真人驱动的方法。具体来说,通过动作捕获设备,捕捉真人演员的身体和表情的数据,然后使用该数据去驱动一个3D或2D虚拟形象对这些动作和表情进行展示。第二类是数据驱动的方法。具体来说,通过TTS(Text To  Speech)的方式,使虚拟形象朗读输入的文本内容。然而,虚拟形象并没有任何的表情和动作展示,这仅能适用于新闻主持等极少需要表情和动作的场景。
这些虚拟形象驱动方式或者是有明显的人为驱动痕迹,或者避免动作表情等较为个性化的行为部分,都难以达到在背后无人驱动的情况下,基于文本控制虚拟形象呈现类似真人的行为。
在根据本公开的虚拟形象行为控制方法和设备中,通过数据驱动而非真人来驱动虚拟形象呈现相应的行为,因此可不间断运行且做到千人千面。并且基于文本提取不同的类别数据,再映射到虚拟形象的行为上,使得触发的行为是适合当前文本的,且与其他技术相比,该行为是丰富的。此外,由于基于预定的映射规则来确定虚拟形象呈现的行为,因此可拓展性强,可以不断地丰富行为内容,同时只需要更新映射规则就能使得虚拟形象呈现新增的行为。
将参照图1描述根据本公开的实施例的、基于文本的虚拟形象行为控制方法的具体过程。例如,虚拟形象的具体表现形式可以是与真人相同的替身形象,也可以是完全虚拟的卡通形象。举例而言,在新闻播报的应用场景中,虚拟形象是与真实播音员相同的替身形象。作为新闻主播的虚拟形象不仅可以基于文本在短时间内生成新闻播报视频,并且能保证播报新闻内容的“零失误”,无论各种场景都能快速上岗,还能24小时不间断播报,助力媒体行业效率提升。或者,在虚拟游戏的应用场景中,作为不同游戏角色的卡通形象可以基于文本而展现丰富的行为,并且能够24小时不间断地执行其角色任务,如24小时的游戏讲解、24小时的陪聊等。
如图1所示,所述方法可以由电子设备执行,包括以下操作。
S101,在文本中插入特定符号,并生成与所述特定符号和文本中的 各个元素对应的多个输入向量。
这里,文本通常为一句话。在一些实施例中,所述特定符号可以是用于表示文本分类的CLS(Classification)符号,这里,S101中插入的特定符号可以是CLS符号对应的原始向量。并且,所述特定符号在所述文本中的插入位置可以是任意的。例如,可以将所述特定符号插入在所述文本之前,也可以将所述特定符号插入在所述文本之后,或者也可以将所述特定符号插入在所述文本的中间。
在插入特定符号之后,分割所述文本中包含的各个元素。例如,所述元素可以是字,也可以是词。也就是说,可以以字为单位,对文本进行分割。或者,也可以以词为单位,对文本进行分割。然后,所述特定符号和文本中的各个元素转换为一系列能够表达文本语义的向量,即:将所述特定符号和文本中的各个元素映射或嵌入到另一个数值向量空间,从而生成对应的多个输入向量。
S102,将所述多个输入向量分别输入至第一编码网络,其中所述第一编码网络包括至少一层网络节点,并且基于与所述特定符号对应的网络节点的注意力向量,确定所述文本中的行为触发位置。其中,与所述特定符号对应的网络节点的所述注意力向量中的每一个元素分别指示从与所述特定符号对应的网络节点到同一层中的每一个网络节点的注意力权重。
图2示出了本申请一些实施例中所述第一编码网络的内部结构的示意图。所述第一编码网络的输入是在S101中得到的各个字/词/特定符号的原始向量,输出是各个字/词/特定符号融合了全文语义信息后的向量表示。例如,对于第一层中的第一个网络节点而言,计算与该网络节点对应的第一个元素的输入向量与其上下文各个元素的输入向量的加权和,作为该网络节点的编码向量,并且将该编码向量作为输入提供至第 二层中的第一个网络节点,直至最后一层的第一个网络节点,以得到最终的融合了全文语义信息后的第一编码输出。在图2中,所述第一编码网络包括多层网络节点。当然,本公开并不仅限于此。所述第一编码网络也可以仅包括一层网络节点。
例如,作为一种可能的实施方式,所述第一编码网络可以通过BERT(Bidirectional Encoder Representations from Transformer)模型来实现。BERT模型的目标是利用大规模无标注语料训练、获得文本的包含丰富语义信息的语义表示(Representation),然后将文本的语义表示在特定自然语言处理(Natural Language Processing,NLP)任务中作微调,最终应用于该NLP任务。
因此,BERT模型的输入是在S101中得到的文本中各个字/词的原始词向量,输出是文本中的各个字/词融合了全文语义信息后的向量表示。
BERT模型是基于注意力(attention)机制的模型。注意力机制的主要作用是让神经网络把“注意力”放在一部分输入上,即:区分输入的不同部分对输出的影响。这里,将从增强字/词的语义表示的角度来理解注意力机制。
一个字/词在一句文本中表达的意思通常与它的上下文有关。比如:光看“鹄”字,我们可能会觉得很陌生,而看到它的上下文“鸿鹄之志”后,就对它马上熟悉了起来。因此,字/词的上下文信息有助于增强其语义表示。同时,上下文中的不同字/词对增强语义表示所起的作用往往不同。比如在上面这个例子中,“鸿”字对理解“鹄”字的作用最大,而“之”字的作用则相对较小。为了有区分地利用上下文的字/词信息增强目标字/词的语义表示,就可以用到注意力机制。
图3示出了本申请一些实施例中注意力机制的示意图。在图3中,以输入的第一个元素(字、词、或特定符号)为例,描述注意力机制的 计算过程。
如图3所示,将输入的第一个元素作为目标元素,并且将与第一个元素对应的第一层编码网络中的第一个网络节点作为目标网络节点。注意力机制将目标元素和上下文各个元素的语义向量表示作为输入,首先通过特定的矩阵变换获得目标元素的Query向量、上下文各个元素的Key向量以及目标元素与上下文各个元素的原始Value。具体来说,对于目标元素,基于训练后的变换矩阵W Q创建Query向量,并且对于目标元素与上下文各个元素,分别基于训练后的变换矩阵W K和W V创建Key向量和Value向量。例如,这些向量是通过将输入向量与3个训练后的变换矩阵W Q、W K、W V相乘得到的。假设提供至第一编码网络的输入为X=(x 1,x 2,……,x n),其中第一个元素的向量为x 1,那么与x 1对应的Query向量q 1、上下文各个元素的Key向量k i以及目标元素与上下文各个元素的原始Value向量v i可以按照以下公式来计算:
q 1=x 1×W Q
k i=x i×W K
v i=x i×W V
其中i为从1到n的整数。
然后,基于Query向量和Key向量,计算第一层编码网络中的第一个网络节点(即,目标网络节点)的注意力向量
Figure PCTCN2020113147-appb-000001
Figure PCTCN2020113147-appb-000002
其中,目标网络节点的注意力向量
Figure PCTCN2020113147-appb-000003
中的每一个元素分别指示从目标网络节点到上下文各个网络节点(即,同一层中的每一个网络节点)的注意力权重。例如,
Figure PCTCN2020113147-appb-000004
表示在第一层编码网络中从第一个网络节点到同一层中第i个网络节点的注意力权重。
Figure PCTCN2020113147-appb-000005
可以通过将q 1乘以k i,然后再通过softmax函数归一化而得到。最后,基于注意力向量
Figure PCTCN2020113147-appb-000006
与Value 向量V,得到目标元素的注意力输出。例如,目标网络节点的注意力输出可以按照以下公式计算:
Figure PCTCN2020113147-appb-000007
也就是说,以与目标网络节点对应的注意力向量作为权重,加权融合向所述目标网络节点输入的目标元素的Value向量和上下文各个元素的Value向量,作为目标网络节点的编码输出,即:目标元素的增强语义向量表示。
图3中所示的注意力输出对应于图2中的第一层编码网络中的第一个网络节点的编码输出。在所述第一编码网络仅具有一层网络节点的情况下,图3中所示的注意力输出即为与输入的第一个元素对应的最终编码输出。在所述第一编码网络具有多层网络节点的情况下,将图3中所示的第一层的第一个网络节点的注意力输出作为输入提供至第二层编码网络的第一个网络节点,并且按照类似的方法,得到第二层编码网络的第一个网络节点的编码输出。然后,逐层地重复类似的处理,直至最后一层。在最后一层编码网络中的第一个网络节点的编码输出即为与输入的第一个元素对应的最终编码输出。
可见,在所述第一编码网络具有多层网络节点的情况下,对于输入的目标元素,在每一层中都计算与目标元素对应的网络节点的注意力向量。在当前层中,以与目标元素对应的网络节点的注意力向量作为权重,对输入到该层的所有向量进行加权求和,并将得到的加权和作为融合了上下文语义的、当前层的输出编码向量。然后,当前层的输出进一步作为下一层的输入,并重复相同的处理。也就是说,假设第一编码网络共有L层,且目标元素为输入的第一个元素,那么将得到与目标元素对应的L个注意力向量
Figure PCTCN2020113147-appb-000008
所述L个注意力向量分别对应于 L层编码网络。
然后,基于与所述特定符号对应的网络节点的注意力向量,确定所述文本中的行为触发位置。其中,与所述特定符号对应的网络节点的所述注意力向量中的每一个元素分别指示从与所述特定符号对应的网络节点到同一层中的每一个网络节点的注意力权重。
例如,假设将所述特定符号插入在所述文本之前,那么与所述特定符号对应的网络节点即为每一层编码网络中的第一个网络节点,并且与所述特定符号对应的网络节点的注意力向量包括每一层中第一个网络节点的注意力向量。
这里,需要说明的是,如将要在下文中描述的那样,行为可以包括动作和表情中的至少一个。由于虚拟形象是基于文本来做出对应的表情或动作,因此不仅需要基于文本,确定虚拟形象应该呈现的行为的具体内容,而且还需要确定虚拟形象应该在播放至文本的哪一个元素(字/词)所对应的音频时呈现相应的行为。与虚拟形象呈现相应行为的时刻对应的、文本中的元素位置就是行为触发位置。
如上文中所述,在BERT模型中,基于注意力机制,利用上下文的字/词信息增强目标字/词的语义表示。并且,在根据本公开的BERT模型中,还进一步插入了用于表示文本分类的CLS(Classification)符号。与文本中包括的其他字/词相比,插入的CLS符号不具有明显的语义信息。从而,这个无明显语义信息的符号将会更“公平”地融合文本中各个字/词的语义信息。因此,与CLS符号对应的网络节点的注意力向量中各元素的权重值可以体现文本中各个字/词的重要性。如果注意力权重值越大,则表明对应的字/词的重要性越高。
在根据本公开的方法中,认为在文本中重要性最高的字/词位置处,控制虚拟形象呈现相应的行为是合适的。因此,将文本中重要性最高的 字/词位置作为行为触发位置。由于与所述特定符号对应的网络节点的注意力向量能够体现文本中各个字/词的重要性,因此可以基于与所述特定符号对应的网络节点的注意力向量,确定所述文本中的行为触发位置。
具体来说,当第一编码网络仅具有一层网络节点时,基于与所述特定符号对应的网络节点的注意力向量,确定所述文本中的行为触发位置。假设所述特定符号对应于第一个输入向量,因此与所述特定符号对应的网络节点为第一个网络节点。并且,假设第一个网络节点的注意力向量A 1=(a 11,a 12,……,a 1n),那么可以按照以下公式计算行为触发位置p:
p=argmax i(a 1i)
其中,该公式表示将a 1i取得最大值时的索引i赋予p。
当第一编码网络具有多层网络节点时,S102中基于与所述特定符号对应的网络节点的注意力向量,确定所述文本中的行为触发位置进一步包括:计算所述第一编码网络的所有层中与所述特定符号对应的节点到每一个节点的注意力向量的平均值,以得到平均注意力向量;以及基于所述平均注意力向量中数值最大的元素的索引位置,确定所述行为触发位置。
具体来说,如上文中所述,当第一编码网络具有多层网络节点时,在每一层中都存在一个与所述特定符号对应的网络节点,并且在每一层中都计算与所述特定符号对应的网络节点的注意力向量。假设第一编码网络共有L层,那么将得到与所述特定符号对应的L个网络节点的L个注意力向量
Figure PCTCN2020113147-appb-000009
在这种情况下,首先对这L个注意力向量求平均,以获得平均注意力向量
Figure PCTCN2020113147-appb-000010
Figure PCTCN2020113147-appb-000011
然后,按照如下公式确定行为触发位置:
Figure PCTCN2020113147-appb-000012
其中,该公式表示将
Figure PCTCN2020113147-appb-000013
取得最大值时的索引i赋予p。
在上文中描述了如何基于第一编码网络确定虚拟形象的行为触发位置。在确定出虚拟形象的行为触发位置之后,还需要确定虚拟形象需要呈现的行为内容。
在S103,基于从所述第一编码网络输出的、与所述特定符号对应的编码向量,确定所述文本对应的行为内容。
如上文中所述,所述第一编码网络输出与各输入向量对应的、融合了上下文各个元素的语义的多个第一编码向量。由于在提供至第一编码网络的输入中插入了无明显语义信息的特定符号CLS,并且这个无明显语义信息的符号会更“公平”地融合文本中各个字/词的语义信息,因此将与该特定符号对应的第一编码向量作为整句文本的语义表示,以便用于文本分类。
图4示出了本申请一些实施例中第一编码网络和第一分类网络的输入输出示意图。并且,图5示出了图1中的S103的具体过程。
如图5所示,基于从所述第一编码网络输出的、与所述特定符号对应的第一编码向量,确定行为内容进一步包括以下操作。
S501,如图4所示,将从所述第一编码网络输出的、与所述特定符 号对应的第一编码向量h CLS输入至第一分类网络(前馈神经网络+softmax)。所述第一分类网络可以是单层的神经网络,也可以是多层的神经网络。并且,当需要分类的类别有多种时,可以调整第一分类网络,使其具有更多的输出神经元,然后通过softmax函数归一化为取值范围从0到1的数值。具体地,所述第一分类网络的输出
Figure PCTCN2020113147-appb-000014
为与行为的类别数目相同维度的行为预测向量,其中每一个元素表示所述文本对应于相应的行为类别的概率值。
假设文本序列为X=(x 1,x 2,…,x n),其中x i为句子X中的第i个元素(字/词),并且在文本之前插入CLS符号,那么将CLS符号和文本所对应的向量输入到BERT模型中,可以获得与CLS符号对应的输出向量:
h CLS=BERT(X)[0]
S502,基于所述第一分类网络的输出
Figure PCTCN2020113147-appb-000015
确定行为类别。具体地,将h CLS作为输入向量提供至第一分类网络,并且第一分类网络可以输出文本对应于每一类行为类别的概率值:
Figure PCTCN2020113147-appb-000016
其中,W表示第一分类网络中的网络节点权重,b为偏移常数。
Figure PCTCN2020113147-appb-000017
中最大概率对应的类别i即为文本所属的行为类别。在图4中,示出了第5个元素的概率值最大的情况,即:i=5。
或者,作为另一种可能的实施方式,基于所述第一分类网络的输出,确定行为类别可以包括:确定所述行为预测向量中的最大概率值;当所述最大概率值大于预定阈值时,将所述最大概率值对应的行为类别作为与所述文本对应的行为类别,否则,将与所述最大概率值对应的行为类别不同的特定类别确定为与所述文本对应的行为类别。
也就是说,在确定文本所属的行为类别时,进一步判断第一分类网络的行为预测结果的置信度。如果最大概率值
Figure PCTCN2020113147-appb-000018
小于预定阈值,则认为 第一分类网络输出的行为预测结果的置信度低。在这种情况下,不采用第一分类网络的预测结果,而是将文本所属的行为类别确定为与所述最大概率值对应的行为类别不同的特定类别。例如,所述特定类别可以是中性类别。另一方面,如果最大概率值
Figure PCTCN2020113147-appb-000019
大于预定阈值,则认为第一分类网络输出的行为预测结果的置信度高。在这种情况下,采用第一分类网络的预测结果。
S503,至少基于所述行为类别,通过特定的行为映射,确定所述行为内容。例如,所述特定的行为映射包括行为映射表。可以通过查找预先设置的映射表,基于行为类别,确定所述行为内容。具体来说,至少基于所述行为类别,通过特定的行为映射,确定所述行为内容进一步包括:在所述行为映射表中,查找与所述行为类别对应的行为内容,并将其确定为所述行为内容。
其中,针对所述虚拟形象的不同应用场景,所述特定的行为映射是不同的。例如,与新闻场景对应的映射表将不会触发较为夸张的行为内容。
在上文中,详细描述了将文本提供至第一编码网络,并且基于第一编码网络的注意力机制,估计行为触发位置。同时,进一步将第一编码网络的输出向量输入至第一分类网络,并从第一分类网络得到文本所属的行为类别的预测结果。例如,可以采用BERT模型来实现所述第一编码网络。
上述第一编码网络、第一分类网络都是需要训练的。
对于BERT模型而言,通常采用大规模、与特定NLP任务无关的文本语料进行预训,其目标是学习语言本身应该是什么样的。这就好比我们学习语文、英语等语言课程时,都需要学习如何选择并组合我们已经掌握的词汇来生成一篇通顺的文本。回到BERT模型上,其预训过程就 是逐渐调整模型参数,使得模型输出的文本语义表示能够刻画语言的本质,便于后续针对具体NLP任务作微调。例如,可以采用200G左右的中文新闻语料进行基于字的中文BERT模型的预训。
在本公开中,具体NLP任务为文本分类任务。在这种情况下,完成预训的BERT模型和第一分类网络进行联合训练。在该联合训练阶段,重点在于第一分类网络的训练,而对BERT模型的改动非常小,这种训练过程成为微调(fine-tuning)。在第一分类网络的训练过程中,涉及到的是机器学习中的监督学习。这意味着需要一个标记好的数据集来训练这样的模型。作为一种可能的实施方式,可以抓取带有Emoji标记的微博数据作为标记好的数据集。具体来说,在微博数据中,用户发布的文本中通常会带有对应的Emoji表情。例如,如果一句文本中带有微笑的Emoji表情,那么可以将微笑的Emoji表情类别作为该文本的正解表情类别。又如,如果一句文本中带有抱拳的Emoji动作,那么可以将抱拳的Emoji动作类别作为该文本的正解表情类别。此外,与其他分类网络的训练类似地,第一分类网络的优化可以通过最小化交叉熵损失函数获得。
这里,需要指出的是,所述行为内容可以包括动作内容和表情内容中的至少一个。例如,所述行为内容可以仅包括动作内容,也可以仅包括表情内容,或者可以既包括动作内容也包括表情内容。例如,动作内容可以包括但不限于:比心、作揖、撇嘴、打哈欠、挖鼻等。表情内容可以包括但不限于:微笑、皱眉、不屑、大笑等。
在所述行为内容既包括动作内容也包括表情内容的情况下,上文中所述的第一编码网络可以进一步包括对应于动作的第三编码子网络和对应于表情的第四编码子网络。将所述多个输入向量分别输入至第一编码网络,并且基于与所述特定符号对应的网络节点的注意力向量,确定 所述文本中的行为触发位置进一步包括:将所述多个输入向量分别输入至第三编码子网络,其中所述第三编码子网络包括至少一层网络节点,并且基于与所述特定符号对应的、所述第三编码子网络中的网络节点的注意力向量,确定所述文本中的动作触发位置;以及将所述多个输入向量分别输入至第四编码子网络,其中所述第四编码子网络包括至少一层网络节点,并且基于与所述特定符号对应的、所述第四编码子网络中的网络节点的注意力向量,确定所述文本中的表情触发位置。
这两个编码子网络的参数数量相同,但参数的值不同。具体结构和配置与上文中描述的第一编码网络类似,这里不再赘述。因此,对于同一个文本,基于不同的编码子网络,得到的动作触发位置和表情触发位置是不同的。相应的,第一分类网络也进一步包括对应于动作的第三分类子网络和对应于表情的第四分类子网络。这两个分类子网络的参数数量相同,但参数的值不同。具体结构和配置与上文中描述的第一分类网络类似,这里不再赘述。
并且,在所述行为内容既包括动作内容也包括表情内容的情况下,可以预先设置表情映射表和动作映射表,然后基于表情类别和行为类别,查找表情映射表以确定对应的表情内容,并且基于表情类别和行为类别,查找动作映射表以确定对应的动作内容。
此外,除了行为类别之外,还可以进一步基于文本确定所属的情感类别。在这种情况下,根据本公开的方法可以进一步包括以下操作:将所述多个输入向量分别输入至第二编码网络;将从所述第二编码网络输出的、与所述特定符号对应的第二编码向量输入至第二分类网络;以及基于所述第二分类网络的输出,确定情感类别。例如,情感类别可以包括但不限于:生气、开心等。这里,第二编码网络与第一编码网络是类似的,且两个网络的参数数量相同,但参数值根据情况可以相同,也可 以不同。例如,当行为内容仅包括表情内容时,第一编码网络与第二编码网络的参数可以相同。或者,当行为内容仅包括动作内容时,第一编码网络与第二编码网络的参数可以不同。
与上文中所述的第一编码网络和第一分类网络类似地,所述第二编码网络和第二分类网络也是需要训练的,且训练方法与与上文中所述的训练方法类似。可以使用带有Emoji表情的微博数据作为用于训练情绪类别的标记数据。
在这种情况下,至少基于所述行为类别,通过特定的行为映射,确定所述行为内容进一步包括:基于所述行为类别和所述情感类别,通过特定的行为映射,确定所述行为内容。
如果将行为类别看作是自变量,行为内容看作是因变量,那么情感类别可以看作是在行为类别的基础上,进一步增加了一个维度的自变量,用于确定最终的行为内容。
图6示出了根据本公开的一种实施例的虚拟形象行为控制的产品流程图。在图6中,示出了这样的实施例:其中,行为内容可以包括动作内容和表情内容二者,并且基于文本分别提取动作类别、表情类别和情感类别以及相应的动作触发位置和表情触发位置。
首先,将文本经过算法处理得到每一句文本对应的表情、动作和情感。例如,表情和动作可以选择目前应用广泛的Emoji表情和动作。当然,也可以增加更多常见的表情和动作,使得输出的表情和动作更加精细化。情感为文本所包含的情感分类,如生气、开心等。表情和动作的触发精确到字或词,即:文本中的某一个字或词将触发规定的动作和表情。
然后,在基于算法确定出初步的表情和动作后,分别通过动作映射表和表情映射表来确定当前文本应触发的表情和动作内容。由于每一句 文本未必都能得到动作、表情和情绪这三个参数,因此可能会出现只有动作、只有表情、只有情感、有动作和表情、有动作和情感、有表情和情感、三个参数都有这7种情况。图7示出了表情映射表的一种示例。图7所示的示例对应于具有动作、表情和情绪这三个参数的情况。其中,对应已有直播表情ID表示虚拟形象所要呈现的表情,动作ID、表情ID和情感ID分别对应于基于文本确定的表情、动作和情感。
图8示出了根据本公开的一种实施例的行为生成流程的示意图。在图8所示的实施例中,行为包括动作和表情二者,并且,基于文本分别提取动作类别、表情类别和情感类别以及相应的动作触发位置和表情触发位置。然后,基于动作类别、表情类别和情感类别,通过特定的映射规则,确定虚拟形象应该呈现的动作内容和表情内容。图8中的动作模型和表情模型都可以通过上文中所述的第一编码网络和第一分类网络来实现,只不过取决于具体的动作模型、表情模型和情感模型,对应的具体网络参数有所不同。
需要指出的是,这里的映射规则可以结合虚拟形象所处的当前场景进行进一步的筛选。例如,与新闻场景对应的映射规则将不会触发较为夸张的动作和表情。
此外,尽管图8示出了动作模型、表情模型和情感模型,但是如上文中所述,本公开并不限于此。例如,基于文本仅提取动作类别、仅提取表情类别、提取动作类别和情感类别、提取表情类别和情感类别、提取动作类别和表情类别等组合变体也都包括在本公开的范围内。
返回参照图1,最后,在确定出行为内容以及行为触发位置之后,在S104,播放与所述文本对应的音频,并且当播放到所述行为触发位置时,控制所述虚拟形象呈现所述行为内容。
这里,考虑到真实的人在说话时进行的行为(如,表情)是连续自 然变化的,因此在控制所述虚拟形象呈现所述行为内容时,可以进一步对触发的行为进行细微调节。
具体地,控制所述虚拟形象呈现所述行为内容进一步包括:基于所述行为内容,调整所述虚拟形象的行为变化参数,使得所述虚拟形象从不呈现行为内容连贯地变化到呈现所述行为内容。例如,可以调节每一个行为变化参数,可调节的行为变化参数包括但不限于行为出现时间、行为结束时间、行为变化系数等,从而保证每一个行为的变化都是自然连贯拟人的。下面是用于实现行为变化参数调节的程序代码示例。在该段代码中,以表情为例,示出了具体的调节参数设置,包括在做出表情之前等待预定时段、表情淡入、表情保持时间段、表情淡出等,以保证每一个表情的变化都是自然连贯拟人的。
private static readonly double[]DefaultRandomRanges={
0,0.5  /*等待0秒到0.5秒后开始做表情*/,
0.3,0.5  /*表情淡入(从无到有)跨度在0.3秒到0.5秒之间*/,
0.75,1  /*表情最终的程度占所捏表情系数的比例在0.75到1之间*/,
0.5,1  /*表情保持的时间在0.5秒到1秒之间*/,
0.3,0.5  /*表情淡出(从有到无)跨度在0.15秒到0.3秒之间*/,
0.1,0.25  /*表情恢复的程度占所捏表情系数的比例在0.1到0.3之间*/,
2,4  /*下一段微表情(如果有)之前的保持时间在2秒到4秒之间*/
};
private static readonly double[]BlinkEyesDefaultRandomRanges={
0,0.5  /*等待0秒到0.5秒后开始做表情*/,
0.167,0.167  /*表情淡入(从无到有)为0.167秒*/,
1,1  /*表情淡入程度100%*/,
0,0  /*表情不保持*/,
0.167,0.167  /*表情淡出(从有到无)为0.167秒*/,
0,0  /*表情淡出至完全消失*/,
2,4  /*下一段微表情(如果有)之前的保持时间在2秒到4秒之间*/
};
在上文中,参照图1到图8详细地描述了根据本公开的基于文本的虚拟形象行为控制方法。可以看出,在根据本公开的方法中,通过数据驱动而非真人来驱动虚拟形象呈现相应的行为,因此可不间断运行且做到千人千面。并且基于文本提取不同的类别数据,再映射到虚拟形象的行为上,使得触发的行为是适合当前文本的,且与其他技术相比,该行为是丰富的。此外,由于基于预定的映射规则来确定虚拟形象呈现的行为,因此可拓展性强,可以不断地丰富行为内容,同时只需要更新映射规则就能使得虚拟形象呈现新增的行为。
此外,在本公开中,使用BERT模型来实现第一编码网络,不仅能够基于注意力机制估计行为触发位置,还能够在文本分类的准确率上有所提升。下表一分别示出了基于BERT模型的文本分类模型和基于CNN的文本分类模型在动作、表情和情感分类的准确度。
表一
方法\任务 动作 表情 情感
CNN 82.53% 74.38% 65.69%
BERT 87.23% 85.40% 77.14%
接下来,将参照图9描述根据本公开的实施例的基于文本的虚拟形象行为控制设备。如图9所示,所述设备1000包括:向量化装置1001、行为触发位置确定装置1002、行为内容确定装置1003和行为呈现装置1004。
向量化装置1001用于在文本中插入特定符号,并生成与所述特定符号和文本中的各个元素对应的多个输入向量,所述特定符号为用于表示文本分类的符号。
这里,文本通常为一句话。并且,例如,所述特定符号可以是用于表示文本分类的CLS(Classification)符号。并且,所述特定符号在所述文本中的插入位置可以是任意的。例如,可以将所述特定符号插入在所述文本之前,也可以将所述特定符号插入在所述文本之后,或者也可以将所述特定符号插入在所述文本的中间。
在插入特定符号之后,向量化装置1001分割所述文本中包含的各个元素。例如,所述元素可以是字,也可以是词。也就是说,可以以字为单位,对文本进行分割。或者,也可以以词为单位,对文本进行分割。然后,向量化装置1001将所述特定符号和文本中的各个元素转换为一系列能够表达文本语义的向量,即:将所述特定符号和文本中的各个元素映射或嵌入到另一个数值向量空间,从而生成对应的多个输入向量。
行为触发位置确定装置1002用于将所述多个输入向量分别输入至第一编码网络,其中所述第一编码网络包括至少一层网络节点,并且基于与所述特定符号对应的网络节点的注意力向量,确定所述文本中的行为触发位置,其中,所述注意力向量中的每一个元素分别指示从与所述特定符号对应的网络节点到与该网络节点同一层中的每一个网络节点的注意力权重。例如,第一编码网络可以通过BERT模型来实现。
如上文中所述,由于虚拟形象是基于文本来做出对应的表情或动作,因此不仅需要基于文本,确定虚拟形象应该呈现的行为的具体内容,而且还需要确定虚拟形象应该在播放至文本的哪一个元素(字/词)所对应的音频时呈现相应的行为。与虚拟形象呈现相应行为的时刻对应的、文本中的元素位置就是行为触发位置。
在BERT模型中,基于注意力机制,利用上下文的字/词信息增强目标字/词的语义表示。并且,在根据本公开的BERT模型中,还进一步插入了用于表示文本分类的CLS(Classification)符号。与文本中包括的其他字/词相比,插入的CLS符号不具有明显的语义信息。从而,这个无明显语义信息的符号将会更“公平”地融合文本中各个字/词的语义信息。因此,与CLS符号对应的网络节点的注意力向量中各元素的权重值可以体现文本中各个字/词的重要性。如果注意力权重值越大,则表明对应的字/词的重要性越高。
在根据本公开的设备中,认为在文本中重要性最高的字/词位置处,控制虚拟形象呈现相应的行为是合适的。因此,行为触发位置确定装置1002将文本中重要性最高的字/词位置作为行为触发位置。由于与所述特定符号对应的网络节点的注意力向量能够体现文本中各个字/词的重要性,因此行为触发位置确定装置1002可以基于与所述特定符号对应的网络节点的注意力向量,确定所述文本中的行为触发位置。
具体来说,当第一编码网络仅具有一层网络节点时,所述行为触发位置确定装置1002进一步被配置为:基于与所述特定符号对应的网络节点的注意力向量,确定所述文本中的行为触发位置。
当第一编码网络具有多层网络节点时,所述行为触发位置确定装置1002进一步被配置为:针对所述第一编码网络的每一层,计算该层中与所述特定符号对应的节点的注意力向量,确定所有层的注意力向量的平均值,以得到平均注意力向量;以及基于所述平均注意力向量中数值最大的元素的索引位置,确定所述行为触发位置。
行为内容确定装置1003用于基于从所述第一编码网络输出的、与所述特定符号对应的第一编码向量,确定行为内容。
如上文中所述,所述第一编码网络输出与各输入向量对应的、融合 了上下文各个元素的语义的多个第一编码向量。由于在提供至第一编码网络的输入中插入了无明显语义信息的特定符号CLS,并且这个无明显语义信息的符号会更“公平”地融合文本中各个字/词的语义信息,因此将与该特定符号对应的输出的第一编码向量作为整句文本的语义表示,以便用于文本分类。
所述行为内容确定装置1003进一步被配置为:将从所述第一编码网络输出的、与所述特定符号对应的第一编码向量输入至第一分类网络;基于所述第一分类网络的输出,确定所述文本对应的行为类别;以及至少基于所述行为类别,通过特定的行为映射,确定所述行为内容。
所述第一分类网络可以是单层的神经网络,也可以是多层的神经网络。并且,当需要分类的类别有多种时,可以调整第一分类网络,使其具有更多的输出神经元,然后通过softmax函数归一化为取值范围从0到1的数值。具体地,所述第一分类网络的输出为与行为的类别数目相同维度的行为预测向量,其中每一个元素表示所述文本对应于相应的行为类别的概率值。所述行为内容确定装置1003将行为预测向量中最大概率对应的类别作为文本所属的行为类别。
或者,作为另一种可能的实施方式,所述行为内容确定装置1003进一步被配置为通过执行以下处理来实现基于所述第一分类网络的输出,确定行为类别:确定所述行为预测向量中的最大概率值;以及当所述最大概率值大于预定阈值时,将所述最大概率值对应的行为类别作为与所述文本对应的行为类别,否则,将与所述最大概率值对应的行为类别不同的特定类别确定为与所述文本对应的行为类别。
也就是说,在确定文本所属的行为类别时,所述行为内容确定装置1003进一步判断第一分类网络的行为预测结果的置信度。如果最大概率值小于预定阈值,则所述行为内容确定装置1003认为第一分类网络输 出的行为预测结果的置信度低。在这种情况下,所述行为内容确定装置1003不采用第一分类网络的预测结果,而是将文本所属的行为类别确定为与所述最大概率值对应的行为类别不同的特定类别。例如,所述特定类别可以是中性类别。另一方面,如果最大概率值大于预定阈值,则所述行为内容确定装置1003认为第一分类网络输出的行为预测结果的置信度高。在这种情况下,所述行为内容确定装置1003采用第一分类网络的预测结果。
最后,所述行为内容确定装置1003至少基于所述行为类别,通过特定的行为映射,确定所述行为内容。例如,可以通过查找预先设置的映射表,基于行为类别,确定所述行为内容。
如上文中所述,所述行为内容可以包括动作内容和表情内容中的至少一个。例如,所述行为内容可以仅包括动作内容,也可以仅包括表情内容,或者可以既包括动作内容也包括表情内容。例如,动作内容可以包括但不限于:比心、作揖、撇嘴、打哈欠、挖鼻等。表情内容可以包括但不限于:微笑、皱眉、不屑、大笑等。
在所述行为内容既包括动作内容也包括表情内容的情况下,上文中所述的第一编码网络可以进一步包括对应于动作的第三编码子网络和对应于表情的第四编码子网络。这两个编码子网络的参数数量相同,但参数的值不同。具体结构和配置与上文中描述的编码网络类似,这里不再赘述。因此,对于同一个文本,基于不同的编码子网络,得到的动作触发位置和表情触发位置是不同的。相应的,第一分类网络也进一步包括对应于动作的第三分类子网络和对应于表情的第四分类子网络。这两个分类子网络的参数数量相同,但参数的值不同。具体结构和配置与上文中描述的第一分类网络类似,这里不再赘述。
并且,在所述行为内容既包括动作内容也包括表情内容的情况下, 可以预先设置表情映射表和动作映射表,然后所述行为内容确定装置1003基于表情类别和行为类别,查找表情映射表以确定对应的表情内容,并且基于表情类别和行为类别,查找动作映射表以确定对应的动作内容。
此外,除了行为类别之外,还可以进一步基于文本确定所述文本所属的情感类别。在这种情况下,所述行为内容确定装置1003进一步被配置为:将所述多个输入向量分别输入至第二编码网络;将从所述第二编码网络输出的、与所述特定符号对应的第二编码向量输入至第二分类网络;以及基于所述第二分类网络的输出,确定所述文本所属的情感类别。
其中,所述行为内容确定装置1003进一步被配置为通过执行以下处理来实现至少基于所述行为类别,通过特定的行为映射,确定所述行为内容:基于所述行为类别和所述情感类别,通过特定的行为映射,确定所述行为内容。
如果将行为类别看作是自变量,行为内容看作是因变量,那么情感类别可以看作是在行为类别的基础上,进一步增加了一个维度的自变量,用于确定最终的行为内容。
最后,在所述行为触发位置确定装置1002确定出行为触发位置且所述行为内容确定装置1003确定出行为内容之后,所述行为呈现装置1004用于播放与所述文本对应的音频,并且当播放到所述行为触发位置时,控制所述虚拟形象呈现所述行为内容。
这里,考虑到真实的人在说话时进行的行为(如,表情)是连续自然变化的,因此在控制所述虚拟形象呈现所述行为内容时,所述行为呈现装置1004可以进一步对触发的行为进行细微调节。
具体地,所述行为呈现装置1004可以进一步被配置为:基于所述行为内容,调整所述虚拟形象的行为变化参数,使得所述虚拟形象从不呈 现行为内容连贯地变化到呈现所述行为内容。例如,可调节的行为变化参数包括但不限于行为出现时间、行为结束时间、行为变化系数等,从而保证每一个行为的变化都是自然连贯拟人的。
可以看出,在根据本公开的设备中,通过数据驱动而非真人来驱动虚拟形象呈现相应的行为,因此可不间断运行且做到千人千面。并且基于文本提取不同的类别数据,再映射到虚拟形象的行为上,使得触发的行为是适合当前文本的,且与其他技术相比,该行为是丰富的。此外,由于基于预定的映射规则来确定虚拟形象呈现的行为,因此可拓展性强,可以不断地丰富行为内容,同时只需要更新映射规则就能使得虚拟形象呈现新增的行为。
此外,在本公开中,使用BERT模型来实现编码网络,不仅能够基于注意力机制估计行为触发位置,还能够在文本分类的准确率上有所提升。
由于根据本公开的实施例的虚拟形象行为控制设备与上文中所述的虚拟形象行为控制方法是完全对应的,因此在关于虚拟形象行为控制设备的描述中,并未对展开很多细节内容。本领域的技术人员可以理解,在上文中所述的虚拟形象行为控制方法的所有细节内容都可以类似地应用于虚拟形象行为控制设备中。
此外,根据本公开实施例的方法或设备也可以借助于图10所示的计算设备1100的架构来实现。如图10所示,计算设备1100可以包括总线1110、一个或多个CPU 1120、只读存储器(ROM)1130、随机存取存储器(RAM)1140、连接到网络的通信端口1150、输入/输出组件1160、硬盘1170等。计算设备1100中的存储设备,例如ROM 1130或硬盘1170可以存储本公开提供的虚拟形象行为控制方法的处理和/或通信使用的各种数据或文件以及CPU所执行的程序指令。当然,图10所示的架构 只是示例性的,在实现不同的设备时,根据实际需要,可以省略图10示出的计算设备中的一个或多个组件。
本公开的实施例也可以被实现为计算机可读存储介质。根据本公开实施例的计算机可读存储介质上存储有计算机可读指令。当所述计算机可读指令由处理器运行时,可以执行参照以上附图描述的根据本公开实施例的虚拟形象行为控制方法。所述计算机可读存储介质包括但不限于例如易失性存储器和/或非易失性存储器。所述易失性存储器例如可以包括随机存取存储器(RAM)和/或高速缓冲存储器(cache)等。所述非易失性存储器例如可以包括只读存储器(ROM)、硬盘、闪存等。
迄今为止,已经参照图1到图10详细描述了根据本公开的各实施例的虚拟形象行为控制方法和设备。在根据本公开的各实施例的虚拟形象行为控制方法设备中,通过数据驱动而非真人来驱动虚拟形象呈现相应的行为,因此可不间断运行且做到千人千面。并且基于文本提取不同的类别数据,再映射到虚拟形象的行为上,使得触发的行为是适合当前文本的,且与其他技术相比,该行为是丰富的。此外,由于基于预定的映射规则来确定虚拟形象呈现的行为,因此可拓展性强,可以不断地丰富行为内容,同时只需要更新映射规则就能使得虚拟形象呈现新增的行为。
需要说明的是,在本说明书中,术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、方法、物品或者设备不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、方法、物品或者设备所固有的要素。在没有更多限制的情况下,由语句“包括……”限定的要素,并不排除在包括所述要素的过程、方法、物品或者设备中还存在另外的相同要素。
最后,还需要说明的是,上述一系列处理不仅包括以这里所述的顺序按时间序列执行的处理,而且包括并行或分别地、而不是按时间顺序 执行的处理。
通过以上的实施方式的描述,本领域的技术人员可以清楚地了解到本申请可借助软件加必需的硬件平台的方式来实现,当然也可以全部通过软件来实施。基于这样的理解,本申请的技术方案对背景技术做出贡献的全部或者部分可以以软件产品的形式体现出来,该计算机软件产品可以存储在存储介质中,如ROM/RAM、磁碟、光盘等,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本申请各个实施例或者实施例的某些部分所述的方法。
以上对本申请进行了详细介绍,本文中应用了具体个例对本申请的原理及实施方式进行了阐述,以上实施例的说明只是用于帮助理解本申请的方法及其核心思想;同时,对于本领域的一般技术人员,依据本申请的思想,在具体实施方式及应用范围上均会有改变之处,综上所述,本说明书内容不应理解为对本申请的限制。

Claims (15)

  1. 一种基于文本的虚拟形象行为控制方法,由电子设备执行,包括:
    在文本中插入特定符号,并生成与所述特定符号和文本中的各个元素对应的多个输入向量;所述特定符号为用于表示文本分类的符号;
    将所述多个输入向量分别输入至第一编码网络,其中所述第一编码网络包括至少一层网络节点,并且基于与所述特定符号对应的网络节点的注意力向量,确定所述文本中的行为触发位置,其中,所述注意力向量中的每一个元素分别指示从与所述特定符号对应的网络节点到与该网络节点同一层中的每一个网络节点的注意力权重;
    基于从所述第一编码网络输出的、与所述特定符号对应的第一编码向量,确定行为内容;以及
    播放与所述文本对应的音频,并且当播放到所述行为触发位置时,控制所述虚拟形象呈现所述行为内容。
  2. 根据权利要求1所述的方法,其中,基于与所述特定符号对应的网络节点的注意力向量,确定所述文本中的行为触发位置包括:
    针对所述第一编码网络的每一层,计算该层中与所述特定符号对应的网络节点的注意力向量,确定所有层的注意力向量的平均值,以得到平均注意力向量;以及
    基于所述平均注意力向量中数值最大的元素的索引位置,确定所述行为触发位置。
  3. 根据权利要求1所述的方法,其中所述第一编码网络输出与各输入向量对应的、融合了上下文各个元素的语义的多个第一编码向量,并且
    其中基于从所述第一编码网络输出的、与所述特定符号对应的第一编码向量,确定行为内容包括:
    将从所述第一编码网络输出的、与所述特定符号对应的第一编码向量输入至第一分类网络;
    基于所述第一分类网络的输出,确定所述文本对应的行为类别;以及
    至少基于所述行为类别,通过特定的行为映射,确定所述行为内容。
  4. 根据权利要求3所述的方法,其中所述特定的行为映射包括行为映射表,并且
    其中至少基于所述行为类别,通过特定的行为映射,确定所述行为内容进一步包括:
    在所述行为映射表中,查找与所述行为类别对应的行为内容,并将其确定为所述行为内容。
  5. 根据权利要求3所述的方法,其中针对所述虚拟形象的不同应用场景,所述特定的行为映射是不同的。
  6. 根据权利要求3所述的方法,其中所述第一分类网络的输出为行为预测向量,所述行为预测向量的维度与行为类别的数目相同,其中所述行为预测向量的每一个元素表示所述文本对应于相应的行为类别的概率值。
  7. 根据权利要求6所述的方法,其中基于所述第一分类网络的输出,确定所述文本对应的行为类别包括:
    确定所述行为预测向量中的最大概率值;以及
    当所述最大概率值大于预定阈值时,将所述最大概率值对应的行为类别作为与所述文本对应的行为类别;否则,将与所述最大概率值对应的行为类别不同的特定类别确定为所述文本对应的行为类别。
  8. 根据权利要求3所述的方法,进一步包括:
    将所述多个输入向量分别输入至第二编码网络;
    将从所述第二编码网络输出的、与所述特定符号对应的第二编码向量输入至第二分类网络;以及
    基于所述第二分类网络的输出,确定所述文本对应的情感类别,
    其中至少基于所述行为类别,通过特定的行为映射,确定所述行为内容进一步包括:
    基于所述行为类别和所述情感类别,通过特定的行为映射,确定所述行为内容。
  9. 根据权利要求1至8任一项所述的方法,其中所述行为内容包括动作内容和表情内容中的至少一个。
  10. 根据权利要求9所述的方法,其中当所述行为内容包括动作内容和表情内容二者时,所述第一编码网络包括第三编码子网络和第四编码子网络,并且
    其中将所述多个输入向量分别输入至第一编码网络,并且基于与所述特定符号对应的网络节点的注意力向量,确定所述文本中的行为触发位置进一步包括:
    将所述多个输入向量分别输入至第三编码子网络,其中所述第三编码子网络包括至少一层网络节点,并且基于与所述特定符号对应的、所述第三编码子网络中的网络节点的注意力向量,确定所述文本中的动作触发位置;以及
    将所述多个输入向量分别输入至第四编码子网络,其中所述第四编码子网络包括至少一层网络节点,并且基于与所述特定符号对应的、所述第四编码子网络中的网络节点的注意力向量,确定所述文本中的表情触发位置。
  11. 根据权利要求1至10任一项所述的方法,其中控制所述虚拟形象呈现所述行为内容进一步包括:
    基于所述行为内容,调整所述虚拟形象的行为变化参数,使得所述虚拟形象从不呈现行为内容连贯地变化到呈现所述行为内容。
  12. 根据权利要求11所述的方法,其中所述行为变化参数至少包括以下之一:行为出现时间、行为结束时间、行为变化系数。
  13. 一种基于文本的虚拟形象行为控制设备,包括:
    向量化装置,用于在文本中插入特定符号,并生成与所述特定符号和文本中的各个元素对应的多个输入向量;所述特定符号为用于表示文本分类的符号;
    行为触发位置确定装置,用于将所述多个输入向量分别输入至第一编码网络,其中所述第一编码网络包括至少一层网络节点,并且基于与所述特定符号对应的网络节点的注意力向量,确定所述文本中的行为触发位置,其中,所述注意力向量中的每一个元素分别指示从与所述特定符号对应的网络节点到与该网络节点同一层中的每一个网络节点的注意力权重;
    行为内容确定装置,用于基于从所述第一编码网络输出的、与所述特定符号对应的第一编码向量,确定行为内容;以及
    行为呈现装置,用于播放与所述文本对应的音频,并且当播放到所述行为触发位置时,控制所述虚拟形象呈现所述行为内容。
  14. 一种计算机设备,包括:
    处理器;
    与所述处理器相连接的存储器;所述存储器中存储有机器可读指令;所述机器可读指令在被处理器执行时,使得所述处理器执行如权利要求1-12中任一项所述的方法。
  15. 一种计算机可读记录介质,其上存储有指令,所述指令在被处理器执行时,使得所述处理器执行如权利要求1-12中任一项所述的方法。
PCT/CN2020/113147 2019-09-23 2020-09-03 基于文本的虚拟形象行为控制方法、设备和介质 WO2021057424A1 (zh)

Priority Applications (3)

Application Number Priority Date Filing Date Title
EP20867870.6A EP3926525A4 (en) 2019-09-23 2020-09-03 Virtual image behavior control method and device based on text, and medium
JP2021564427A JP7210774B2 (ja) 2019-09-23 2020-09-03 テキストに基づくアバターの行動制御方法、デバイス及びコンピュータプログラム
US17/480,112 US11714879B2 (en) 2019-09-23 2021-09-20 Method and device for behavior control of virtual image based on text, and medium

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910898521.6A CN110598671B (zh) 2019-09-23 2019-09-23 基于文本的虚拟形象行为控制方法、设备和介质
CN201910898521.6 2019-09-23

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US17/480,112 Continuation US11714879B2 (en) 2019-09-23 2021-09-20 Method and device for behavior control of virtual image based on text, and medium

Publications (1)

Publication Number Publication Date
WO2021057424A1 true WO2021057424A1 (zh) 2021-04-01

Family

ID=68862313

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/113147 WO2021057424A1 (zh) 2019-09-23 2020-09-03 基于文本的虚拟形象行为控制方法、设备和介质

Country Status (5)

Country Link
US (1) US11714879B2 (zh)
EP (1) EP3926525A4 (zh)
JP (1) JP7210774B2 (zh)
CN (1) CN110598671B (zh)
WO (1) WO2021057424A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114936283A (zh) * 2022-05-18 2022-08-23 电子科技大学 一种基于Bert的网络舆情分析方法

Families Citing this family (59)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9318108B2 (en) 2010-01-18 2016-04-19 Apple Inc. Intelligent automated assistant
US8977255B2 (en) 2007-04-03 2015-03-10 Apple Inc. Method and system for operating a multi-function portable electronic device using voice-activation
US8676904B2 (en) 2008-10-02 2014-03-18 Apple Inc. Electronic devices with voice command and contextual data processing capabilities
US10706373B2 (en) 2011-06-03 2020-07-07 Apple Inc. Performing actions associated with task items that represent tasks to perform
US10417037B2 (en) 2012-05-15 2019-09-17 Apple Inc. Systems and methods for integrating third party services with a digital assistant
KR20150104615A (ko) 2013-02-07 2015-09-15 애플 인크. 디지털 어시스턴트를 위한 음성 트리거
US10652394B2 (en) 2013-03-14 2020-05-12 Apple Inc. System and method for processing voicemail
US10748529B1 (en) 2013-03-15 2020-08-18 Apple Inc. Voice activated device for use with a voice-based digital assistant
US10176167B2 (en) 2013-06-09 2019-01-08 Apple Inc. System and method for inferring user intent from speech inputs
CN105453026A (zh) 2013-08-06 2016-03-30 苹果公司 基于来自远程设备的活动自动激活智能响应
TWI566107B (zh) 2014-05-30 2017-01-11 蘋果公司 用於處理多部分語音命令之方法、非暫時性電腦可讀儲存媒體及電子裝置
US9715875B2 (en) 2014-05-30 2017-07-25 Apple Inc. Reducing the need for manual start/end-pointing and trigger phrases
US10170123B2 (en) 2014-05-30 2019-01-01 Apple Inc. Intelligent assistant for home automation
US9338493B2 (en) 2014-06-30 2016-05-10 Apple Inc. Intelligent automated assistant for TV user interactions
US9886953B2 (en) 2015-03-08 2018-02-06 Apple Inc. Virtual assistant activation
US10460227B2 (en) 2015-05-15 2019-10-29 Apple Inc. Virtual assistant in a communication session
US10200824B2 (en) 2015-05-27 2019-02-05 Apple Inc. Systems and methods for proactively identifying and surfacing relevant content on a touch-sensitive device
US20160378747A1 (en) 2015-06-29 2016-12-29 Apple Inc. Virtual assistant for media playback
US10671428B2 (en) 2015-09-08 2020-06-02 Apple Inc. Distributed personal assistant
US10747498B2 (en) 2015-09-08 2020-08-18 Apple Inc. Zero latency digital assistant
US10740384B2 (en) 2015-09-08 2020-08-11 Apple Inc. Intelligent automated assistant for media search and playback
US10331312B2 (en) 2015-09-08 2019-06-25 Apple Inc. Intelligent automated assistant in a media environment
US10691473B2 (en) 2015-11-06 2020-06-23 Apple Inc. Intelligent automated assistant in a messaging environment
US10956666B2 (en) 2015-11-09 2021-03-23 Apple Inc. Unconventional virtual assistant interactions
US10223066B2 (en) 2015-12-23 2019-03-05 Apple Inc. Proactive assistance based on dialog communication between devices
US10586535B2 (en) 2016-06-10 2020-03-10 Apple Inc. Intelligent digital assistant in a multi-tasking environment
DK201670540A1 (en) 2016-06-11 2018-01-08 Apple Inc Application integration with a digital assistant
DK179415B1 (en) 2016-06-11 2018-06-14 Apple Inc Intelligent device arbitration and control
US10726832B2 (en) 2017-05-11 2020-07-28 Apple Inc. Maintaining privacy of personal information
DK180048B1 (en) 2017-05-11 2020-02-04 Apple Inc. MAINTAINING THE DATA PROTECTION OF PERSONAL INFORMATION
DK179745B1 (en) 2017-05-12 2019-05-01 Apple Inc. SYNCHRONIZATION AND TASK DELEGATION OF A DIGITAL ASSISTANT
DK179496B1 (en) 2017-05-12 2019-01-15 Apple Inc. USER-SPECIFIC Acoustic Models
DK201770429A1 (en) 2017-05-12 2018-12-14 Apple Inc. LOW-LATENCY INTELLIGENT AUTOMATED ASSISTANT
DK201770411A1 (en) 2017-05-15 2018-12-20 Apple Inc. MULTI-MODAL INTERFACES
US20180336892A1 (en) 2017-05-16 2018-11-22 Apple Inc. Detecting a trigger of a digital assistant
US20180336275A1 (en) 2017-05-16 2018-11-22 Apple Inc. Intelligent automated assistant for media exploration
US10818288B2 (en) 2018-03-26 2020-10-27 Apple Inc. Natural assistant interaction
US11145294B2 (en) 2018-05-07 2021-10-12 Apple Inc. Intelligent automated assistant for delivering content from user experiences
US10928918B2 (en) 2018-05-07 2021-02-23 Apple Inc. Raise to speak
US10892996B2 (en) 2018-06-01 2021-01-12 Apple Inc. Variable latency device coordination
DK180639B1 (en) 2018-06-01 2021-11-04 Apple Inc DISABILITY OF ATTENTION-ATTENTIVE VIRTUAL ASSISTANT
DK179822B1 (da) 2018-06-01 2019-07-12 Apple Inc. Voice interaction at a primary device to access call functionality of a companion device
US11462215B2 (en) 2018-09-28 2022-10-04 Apple Inc. Multi-modal inputs for voice commands
US11348573B2 (en) 2019-03-18 2022-05-31 Apple Inc. Multimodality in digital assistant systems
DK201970509A1 (en) 2019-05-06 2021-01-15 Apple Inc Spoken notifications
US11307752B2 (en) 2019-05-06 2022-04-19 Apple Inc. User configurable task triggers
US11140099B2 (en) 2019-05-21 2021-10-05 Apple Inc. Providing message response suggestions
DK180129B1 (en) 2019-05-31 2020-06-02 Apple Inc. USER ACTIVITY SHORTCUT SUGGESTIONS
DK201970510A1 (en) 2019-05-31 2021-02-11 Apple Inc Voice identification in digital assistant systems
US11227599B2 (en) 2019-06-01 2022-01-18 Apple Inc. Methods and user interfaces for voice-based control of electronic devices
CN110598671B (zh) * 2019-09-23 2022-09-27 腾讯科技(深圳)有限公司 基于文本的虚拟形象行为控制方法、设备和介质
US11593984B2 (en) 2020-02-07 2023-02-28 Apple Inc. Using text for avatar animation
US11043220B1 (en) 2020-05-11 2021-06-22 Apple Inc. Digital assistant hardware abstraction
US11061543B1 (en) 2020-05-11 2021-07-13 Apple Inc. Providing relevant data items based on context
US11755276B2 (en) 2020-05-12 2023-09-12 Apple Inc. Reducing description length based on confidence
US11490204B2 (en) 2020-07-20 2022-11-01 Apple Inc. Multi-device audio adjustment coordination
US11438683B2 (en) 2020-07-21 2022-09-06 Apple Inc. User identification using headphones
CN113194350B (zh) * 2021-04-30 2022-08-19 百度在线网络技术(北京)有限公司 推送待播报数据、播报数据的方法和装置
CN116168134B (zh) * 2022-12-28 2024-01-02 北京百度网讯科技有限公司 数字人的控制方法、装置、电子设备以及存储介质

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102737397A (zh) * 2012-05-25 2012-10-17 北京工业大学 基于运动偏移映射的有韵律头部运动合成方法
CN103761963A (zh) * 2014-02-18 2014-04-30 大陆汽车投资(上海)有限公司 包含情感类信息的文本的处理方法
US20140356822A1 (en) * 2013-06-03 2014-12-04 Massachusetts Institute Of Technology Methods and apparatus for conversation coach
CN104866101A (zh) * 2015-05-27 2015-08-26 世优(北京)科技有限公司 虚拟对象的实时互动控制方法及装置
CN106653052A (zh) * 2016-12-29 2017-05-10 Tcl集团股份有限公司 虚拟人脸动画的生成方法及装置
CN107329990A (zh) * 2017-06-06 2017-11-07 北京光年无限科技有限公司 一种用于虚拟机器人的情绪输出方法以及对话交互系统
CN108595601A (zh) * 2018-04-20 2018-09-28 福州大学 一种融入Attention机制的长文本情感分析方法
CN109118562A (zh) * 2018-08-31 2019-01-01 百度在线网络技术(北京)有限公司 虚拟形象的讲解视频制作方法、装置以及终端
CN109783641A (zh) * 2019-01-08 2019-05-21 中山大学 一种基于双向-gru和改进的注意力机制的实体关系分类方法
CN110598671A (zh) * 2019-09-23 2019-12-20 腾讯科技(深圳)有限公司 基于文本的虚拟形象行为控制方法、设备和介质

Family Cites Families (37)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4726065A (en) * 1984-01-26 1988-02-16 Horst Froessl Image manipulation by speech signals
US5151998A (en) * 1988-12-30 1992-09-29 Macromedia, Inc. sound editing system using control line for altering specified characteristic of adjacent segment of the stored waveform
CA2115210C (en) * 1993-04-21 1997-09-23 Joseph C. Andreshak Interactive computer system recognizing spoken commands
US5832428A (en) * 1995-10-04 1998-11-03 Apple Computer, Inc. Search engine for phrase recognition based on prefix/body/suffix architecture
GB9602691D0 (en) * 1996-02-09 1996-04-10 Canon Kk Word model generation
GB9602701D0 (en) * 1996-02-09 1996-04-10 Canon Kk Image manipulation
JP2000167244A (ja) * 1998-12-11 2000-06-20 Konami Computer Entertainment Osaka:Kk ビデオゲーム装置、ビデオキャラクタに対する疑似チームへの入部勧誘処理制御方法及びビデオキャラクタに対する疑似チームへの入部勧誘処理制御プログラムを記録した可読記録媒体
JP2006048379A (ja) * 2004-08-04 2006-02-16 Ntt Docomo Hokuriku Inc コンテンツ生成装置
US9613450B2 (en) * 2011-05-03 2017-04-04 Microsoft Technology Licensing, Llc Photo-realistic synthesis of three dimensional animation with facial features synchronized with speech
US8676937B2 (en) * 2011-05-12 2014-03-18 Jeffrey Alan Rapaport Social-topical adaptive networking (STAN) system allowing for group based contextual transaction offers and acceptances and hot topic watchdogging
TWI453628B (zh) * 2012-01-12 2014-09-21 Amtran Technology Co Ltd 適應性調整虛擬按鍵尺寸的方法及其顯示裝置
WO2016070354A1 (en) * 2014-11-05 2016-05-12 Intel Corporation Avatar video apparatus and method
US10546015B2 (en) * 2015-12-01 2020-01-28 Facebook, Inc. Determining and utilizing contextual meaning of digital standardized image characters
WO2018097439A1 (ko) * 2016-11-28 2018-05-31 삼성전자 주식회사 발화의 문맥을 공유하여 번역을 수행하는 전자 장치 및 그 동작 방법
US20180315415A1 (en) * 2017-04-26 2018-11-01 Soundhound, Inc. Virtual assistant with error identification
WO2019011968A1 (en) * 2017-07-11 2019-01-17 Deepmind Technologies Limited LEARNING VISUAL CONCEPTS THROUGH NEURONAL NETWORKS
CN108304388B (zh) * 2017-09-12 2020-07-07 腾讯科技(深圳)有限公司 机器翻译方法及装置
US20190220474A1 (en) * 2018-01-16 2019-07-18 Entigenlogic Llc Utilizing multiple knowledge bases to form a query response
US11003856B2 (en) * 2018-02-22 2021-05-11 Google Llc Processing text using neural networks
US10878817B2 (en) * 2018-02-24 2020-12-29 Twenty Lane Media, LLC Systems and methods for generating comedy
US10642939B2 (en) * 2018-02-24 2020-05-05 Twenty Lane Media, LLC Systems and methods for generating jokes
CN108595590A (zh) * 2018-04-19 2018-09-28 中国科学院电子学研究所苏州研究院 一种基于融合注意力模型的中文文本分类方法
US20210365643A1 (en) * 2018-09-27 2021-11-25 Oracle International Corporation Natural language outputs for path prescriber model simulation for nodes in a time-series network
CN109377797A (zh) * 2018-11-08 2019-02-22 北京葡萄智学科技有限公司 虚拟人物教学方法及装置
CN109859760A (zh) * 2019-02-19 2019-06-07 成都富王科技有限公司 基于深度学习的电话机器人语音识别结果校正方法
US11790171B2 (en) * 2019-04-16 2023-10-17 Covera Health Computer-implemented natural language understanding of medical reports
CN110013671B (zh) * 2019-05-05 2020-07-28 腾讯科技(深圳)有限公司 动作执行方法和装置、存储介质及电子装置
US11170774B2 (en) * 2019-05-21 2021-11-09 Qualcomm Incorproated Virtual assistant device
US11604981B2 (en) * 2019-07-01 2023-03-14 Adobe Inc. Training digital content classification models utilizing batchwise weighted loss functions and scaled padding based on source density
CN112487182B (zh) * 2019-09-12 2024-04-12 华为技术有限公司 文本处理模型的训练方法、文本处理方法及装置
US20210304736A1 (en) * 2020-03-30 2021-09-30 Nvidia Corporation Media engagement through deep learning
US20210344798A1 (en) * 2020-05-01 2021-11-04 Walla Technologies Llc Insurance information systems
US11023688B1 (en) * 2020-05-27 2021-06-01 Roblox Corporation Generation of text tags from game communication transcripts
US11620829B2 (en) * 2020-09-30 2023-04-04 Snap Inc. Visual matching with a messaging application
US11386625B2 (en) * 2020-09-30 2022-07-12 Snap Inc. 3D graphic interaction based on scan
US11077367B1 (en) * 2020-10-09 2021-08-03 Mythical, Inc. Systems and methods for using natural language processing (NLP) to control automated gameplay
TWI746214B (zh) * 2020-10-19 2021-11-11 財團法人資訊工業策進會 機器閱讀理解方法、機器閱讀理解裝置及非暫態電腦可讀取媒體

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102737397A (zh) * 2012-05-25 2012-10-17 北京工业大学 基于运动偏移映射的有韵律头部运动合成方法
US20140356822A1 (en) * 2013-06-03 2014-12-04 Massachusetts Institute Of Technology Methods and apparatus for conversation coach
CN103761963A (zh) * 2014-02-18 2014-04-30 大陆汽车投资(上海)有限公司 包含情感类信息的文本的处理方法
CN104866101A (zh) * 2015-05-27 2015-08-26 世优(北京)科技有限公司 虚拟对象的实时互动控制方法及装置
CN106653052A (zh) * 2016-12-29 2017-05-10 Tcl集团股份有限公司 虚拟人脸动画的生成方法及装置
CN107329990A (zh) * 2017-06-06 2017-11-07 北京光年无限科技有限公司 一种用于虚拟机器人的情绪输出方法以及对话交互系统
CN108595601A (zh) * 2018-04-20 2018-09-28 福州大学 一种融入Attention机制的长文本情感分析方法
CN109118562A (zh) * 2018-08-31 2019-01-01 百度在线网络技术(北京)有限公司 虚拟形象的讲解视频制作方法、装置以及终端
CN109783641A (zh) * 2019-01-08 2019-05-21 中山大学 一种基于双向-gru和改进的注意力机制的实体关系分类方法
CN110598671A (zh) * 2019-09-23 2019-12-20 腾讯科技(深圳)有限公司 基于文本的虚拟形象行为控制方法、设备和介质

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
CHEN, YIQIANG ET AL.: "Text-Driven Synthesis of Multimodal Behavior for Virtual Human", THE 7TH GRADUATE ACADEMIC CONFERENCE OF COMPUTER SCIENCE AND TECHNOLOGY, HELD BY INSTITUTE OF COMPUTING TECHNOLOGY, CHINESE ACADEMY OF SCIENCES), 16 August 2007 (2007-08-16), pages 1 - 5, XP009527190 *
See also references of EP3926525A4

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114936283A (zh) * 2022-05-18 2022-08-23 电子科技大学 一种基于Bert的网络舆情分析方法
CN114936283B (zh) * 2022-05-18 2023-12-26 电子科技大学 一种基于Bert的网络舆情分析方法

Also Published As

Publication number Publication date
EP3926525A4 (en) 2022-06-29
US11714879B2 (en) 2023-08-01
JP7210774B2 (ja) 2023-01-23
EP3926525A1 (en) 2021-12-22
US20220004825A1 (en) 2022-01-06
CN110598671B (zh) 2022-09-27
CN110598671A (zh) 2019-12-20
JP2022531855A (ja) 2022-07-12

Similar Documents

Publication Publication Date Title
WO2021057424A1 (zh) 基于文本的虚拟形象行为控制方法、设备和介质
CN110717017B (zh) 一种处理语料的方法
US11934791B2 (en) On-device projection neural networks for natural language understanding
CN111368996B (zh) 可传递自然语言表示的重新训练投影网络
CN109844741B (zh) 在自动聊天中生成响应
CN106845411B (zh) 一种基于深度学习和概率图模型的视频描述生成方法
Nyatsanga et al. A Comprehensive Review of Data‐Driven Co‐Speech Gesture Generation
Liu et al. A multi-modal chinese poetry generation model
WO2023284435A1 (zh) 生成动画的方法及装置
CN110069611B (zh) 一种主题增强的聊天机器人回复生成方法及装置
CN115329779A (zh) 一种多人对话情感识别方法
CN108921032A (zh) 一种新的基于深度学习模型的视频语义提取方法
Huang et al. C-Rnn: a fine-grained language model for image captioning
CN116611496A (zh) 文本到图像的生成模型优化方法、装置、设备及存储介质
Wan et al. Midoriko chatbot: LSTM-based emotional 3D avatar
CN113627550A (zh) 一种基于多模态融合的图文情感分析方法
Farella et al. Question Answering with BERT: designing a 3D virtual avatar for Cultural Heritage exploration
CN116958342A (zh) 虚拟形象的动作生成方法、动作库的构建方法及装置
CN114743056A (zh) 一种基于动态早退的图像描述生成模型及模型训练方法
Zhao et al. Generating diverse gestures from speech using memory networks as dynamic dictionaries
KR20220069403A (ko) 하이라이팅 기능이 포함된 감정 분석 서비스를 위한 방법 및 장치
Yang et al. Film review sentiment classification based on BiGRU and attention
Zhao et al. Improving diversity of speech‐driven gesture generation with memory networks as dynamic dictionaries
WO2024066549A1 (zh) 一种数据处理方法及相关设备
Zeng et al. Research on the Application of Deep Learning Technology in Intelligent Dialogue Robots

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20867870

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 2020867870

Country of ref document: EP

Effective date: 20210913

ENP Entry into the national phase

Ref document number: 2021564427

Country of ref document: JP

Kind code of ref document: A

NENP Non-entry into the national phase

Ref country code: DE