CN110688008A

CN110688008A - Virtual image interaction method and device

Info

Publication number: CN110688008A
Application number: CN201910925952.7A
Authority: CN
Inventors: 周永吉
Original assignee: Guizhou Little Love Robot Technology Co Ltd
Current assignee: Guizhou Little Love Robot Technology Co Ltd
Priority date: 2019-09-27
Filing date: 2019-09-27
Publication date: 2020-01-14

Abstract

The embodiment of the application provides an avatar interaction method, an avatar interaction device, electronic equipment and a computer readable storage medium, and solves the problems of large information amount and large calculation amount when interaction is carried out based on an avatar in the prior art. The virtual character interaction method comprises the following steps: acquiring interactive instruction information of a user; inputting the interaction instruction information into an action generating model, wherein the action generating model is configured to output a plurality of action data respectively corresponding to a plurality of key points of the virtual image according to the interaction instruction information; and driving the plurality of key points of the virtual image to respectively generate corresponding actions according to the plurality of action data.

Description

Virtual image interaction method and device

Technical Field

The present application relates to the field of electronic communication technologies, and in particular, to an avatar interaction method, an avatar interaction apparatus, an electronic device, and a computer-readable storage medium.

Background

With the continuous development of computer technology, the way of communication between users is evolving, wherein interaction through the virtual image becomes one of the hot spots of current internet communication. However, in the prior art, although the interaction between the users can be completed through the avatar, when the interaction instruction of the user is shown through the avatar, each frame of image corresponding to the avatar is generated. The amount of information for generating the complete image is large, and the required computation amount is also very large, which brings unnecessary burden of computing hardware resources and storing hardware resources for the real-time interaction process, and influences the real-time experience of the user in interaction through the virtual image.

Disclosure of Invention

In view of this, embodiments of the present application provide an avatar interaction method, an avatar interaction apparatus, an electronic device, and a computer-readable storage medium, which solve the problems of a large amount of information and a large amount of computation in the prior art when performing interaction based on an avatar.

According to an aspect of the present application, an avatar interaction method provided by an embodiment of the present application includes: acquiring interactive instruction information of a user; inputting the interaction instruction information into an action generating model, wherein the action generating model is configured to output a plurality of action data respectively corresponding to a plurality of key points of the virtual image according to the interaction instruction information; and driving the plurality of key points of the virtual image to respectively generate corresponding actions according to the plurality of action data.

According to another aspect of the present application, an avatar interaction apparatus provided by an embodiment of the present application includes: the acquisition module is configured to acquire interactive instruction information of a user; the action generation model is configured to output a plurality of action data respectively corresponding to a plurality of key points of the virtual image according to the interactive instruction information from the acquisition module; and the driving module is configured to drive the plurality of key points of the virtual image to respectively generate corresponding actions according to the plurality of action data.

According to another aspect of the present application, an embodiment of the present application provides an electronic device including: a processor; and a memory having computer program instructions stored therein, which when executed by the processor, cause the processor to perform the avatar interaction method of any of the preceding claims.

According to another aspect of the present application, an embodiment of the present application provides a computer-readable storage medium having stored thereon computer program instructions, which, when executed by a processor, cause the processor to perform the avatar interaction method as described in any one of the preceding.

According to the avatar interaction method, the avatar interaction device, the electronic device and the computer readable storage medium, the action generation model is adopted to obtain a plurality of action data respectively corresponding to a plurality of key points of the avatar based on the interaction instruction information of the user, so that when the user interacts through the avatar, the user does not need to generate a complete image corresponding to the action of the avatar information, and the corresponding action can be generated only by driving the plurality of key points of the avatar based on the action data. The data volume of the action data of the key points is small, so that the requirement on equipment hardware can be greatly reduced, the real-time interaction experience of a user is improved, some low-configuration hardware terminals can still run the virtual image interaction mode, the application terminal range of the virtual image interaction mode is expanded, and the hardware budget requirement is reduced.

Drawings

Fig. 1 is a schematic flow chart of an avatar interaction method according to an embodiment of the present application.

Fig. 2 is a schematic flowchart illustrating a training process of an action generation model in an avatar interaction method according to an embodiment of the present application.

Fig. 3 is a flowchart illustrating an avatar interaction method according to another embodiment of the present application.

Fig. 4 is a schematic flow chart illustrating obtaining of interaction instruction information in an avatar interaction method according to an embodiment of the present application.

Fig. 5 is a schematic flow chart illustrating generation of interaction instruction information according to natural language information in an avatar interaction method according to an embodiment of the present application.

Fig. 6 is a schematic structural diagram of an avatar interaction apparatus according to an embodiment of the present application.

Fig. 7 is a schematic structural diagram of an avatar interaction apparatus according to another embodiment of the present application.

Fig. 8 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

Fig. 1 is a schematic flow chart of an avatar interaction method according to an embodiment of the present application. As shown in fig. 1, the avatar interaction method includes the steps of:

step 101: and acquiring the interactive instruction information of the user.

The interactive instruction information of the user is reference information used for driving the virtual image to generate corresponding actions, and what corresponding actions the virtual image needs to generate can be determined through the interactive instruction information so as to meet the requirements of the current interactive scene. It should be understood that the interaction instruction information may be implemented in various ways according to different interaction ways with the user in a specific application scenario. In an embodiment of the present application, the interactive instruction information may include one or more of the following information: voice instruction information, character instruction information, action definition instruction information (e.g., raising hand instruction information, lowering head instruction information, etc.), emotion definition instruction information (e.g., happy instruction information, sad instruction information, anger instruction information, etc.), sensor information (e.g., taste information, distance information, ambient temperature information, etc.), and image recognition results (e.g., image-recognized facial features, limb actions, target detection results, etc.).

The interactive instruction information of the user can be explicit instruction information directly input by the user, or can be generated in real time according to some collected user information or information input by the user. For example, the user may directly input the voice instruction information with explicit content of "smile" by voice, or may generate the action definition instruction information with corresponding content of "nod" or smile in real time after recognizing the body or face action (e.g., nod or smile) of the user. The specific source and the specific content of the interactive instruction information for guiding the action of the virtual image are not strictly limited.

Step 102: inputting the interactive instruction information into an action generating model, wherein the action generating model is configured to output a plurality of action data respectively corresponding to a plurality of key points of the avatar according to the interactive instruction information.

The virtual image is a computer-generated image which replaces a user to interact with, can be a two-dimensional image or a three-dimensional image generated by a computer modeling technology, and can be a virtual human, a virtual cartoon image and the like in a specific form. The virtual image is an image of a virtual human being, a cartoon character or other human-like characters which can say, move and have expressions, constructed by adopting the related technology of computer graphics, so that various devices with screens (including mobile phones, large screens and the like) and virtual reality devices can be used as objects for human-computer interaction. The virtual image can be widely applied to scenes such as public places, display centers, personal equipment and the like for man-machine interaction and improving user experience. The specific presentation form of the avatar is not limited in the present application, but it should be understood that the corresponding key points may be different according to the different presentation forms of the avatar. For example, in one embodiment of the present application, the avatar may be a human-like avatar, and the plurality of key points of the avatar may include one or more of the following combinations: body joint feature points, body bone part feature points, facial expression feature points, and mouth feature points.

The action generating model can be established based on a pre-learning training process, and the trained action generating model can directly output a plurality of action data respectively corresponding to a plurality of key points of the virtual image according to the received interactive instruction information. These motion data may be a combination of spatial data, such as displacement, rotation angle and direction, that drive the keypoints to produce corresponding motions. It should be understood that the specific content and form of these motion data are related to the location of the key points and the specific application scenario, for example, the key points of the shoulder joint part may only rotate, so the corresponding motion data may be the rotation angle, and the key points of the lip on the face may rotate and displace, so the corresponding motion data may include both the rotation angle and the displacement. Further, if the avatar is a two-dimensional avatar, the motion data is two-dimensional spatial information; and if the avatar is a three-dimensional avatar, the motion data is three-dimensional spatial information. The specific content and form of the motion data are not strictly limited in this application.

Step 103: and driving a plurality of key points of the virtual image to respectively generate corresponding actions according to a plurality of action data.

When the action data output by the action generating model according to the interactive instruction information of the user is obtained, the key points of the virtual image can be directly driven to generate corresponding actions based on the action data. For example, the user interaction command may be voice command information with a content of "smile", and the motion generation model may output motion data of a series of facial key points according to the voice command information, for example, the motion data of key points in the eyebrow region may make the eyebrows bend downwards, the motion data of key points in the mouth region may make the corners of the mouth rise upwards, and the avatar may make facial expression motions of the eyebrows bend downwards and the corners of the mouth rise according to the motion data of these key points.

Therefore, according to the avatar interaction method provided by the embodiment of the application, the action generation model is adopted to obtain a plurality of action data respectively corresponding to a plurality of key points of the avatar based on the interaction instruction information of the user, so that when the user interacts through the avatar, the user does not need to generate a complete image corresponding to the action of the avatar, and the corresponding action can be generated only by driving the plurality of key points of the avatar based on the action data. The data size of the motion data of the key point is small (the data dimension of the motion data of the key point is 10)²Order of magnitude, while the data dimension for directly generated images is 10⁵More than order of magnitude), so can reduce the demand for the apparatus hardware by a wide margin, not merely help to improve users' real-time interaction experience, still make some low disposition hardware terminals still can run the interactive mode of this virtual image, have expanded the application terminal range of the interactive mode of virtual image, have reduced the hardware budget demand.

Fig. 2 is a schematic flowchart illustrating a training process of an action generation model in an avatar interaction method according to an embodiment of the present application. As shown in FIG. 2, the action generation model may be built based on the following training process:

step 201: a plurality of motion data samples corresponding to the plurality of key points are obtained.

In an embodiment of the present application, the action data samples may be preset, and at this time, a plurality of preset action data samples respectively corresponding to a plurality of key points of the avatar may be directly obtained. For example, the user may customize the motion data samples corresponding to the call-out motion, including motion data samples of key points of the mouth region to open the mouth and motion data samples of key points of the arm region to raise and swing the arm.

In another embodiment of the present application, the motion data samples may also be generated by a motion recognition model. For example, image data including motion content may be input to the motion recognition model, which is configured to output positions and motion trajectories of a plurality of key points from the image data, at which time the positions and motion trajectories of the plurality of key points may be recognized to generate a plurality of motion data samples corresponding to the plurality of key points, respectively.

Step 202: a plurality of interaction instruction samples corresponding to the plurality of action data samples are obtained.

In order to establish the correspondence between the action data samples of the key points and the specific interactive meanings, interactive instruction samples corresponding to the action data samples should be obtained. For example, when an action data sample corresponding to a call-in action is obtained, a voice content or a text content with a content of "hello" may be obtained as a corresponding interaction instruction sample.

Step 203: and taking a plurality of action data samples and a plurality of interactive instruction samples as a training set, and training through a deep learning process to generate an action generation model.

By largely establishing the corresponding relation between the interaction instruction samples and the action data samples of the key points as a training set, the action generating model can have the capability of outputting the action data of the key points according to the received interaction instruction information through a subsequent deep learning process. In an embodiment of the present application, the deep learning process is implemented based on Generative models (such as GAN, generic adaptive Networks) and Sequence models (such as LSTM, Long Short-term memory or Sequence To Sequence models). The application does not strictly limit how the action generation model is built through what deep learning process.

Fig. 3 is a flowchart illustrating an avatar interaction method according to another embodiment of the present application. As shown in fig. 3, the avatar interaction method may include the steps of:

step 301: receiving natural language information of a user, wherein the natural language information comprises voice information and/or text information based on natural language.

The natural language information is information based on natural language expression habits input by a user, and can be voice information or character information. The purpose of subsequently generating the interactive instruction information according to the natural language information is to understand the intention of the user by analyzing the natural language information and thereby generate the interactive instruction information corresponding to the intention of the user.

Step 302: and generating interactive instruction information according to the natural language information.

Here, the interactive instruction information of the user is acquired based on the natural language information input by the user. Specifically, as shown in fig. 4, the interactive instruction information may be acquired by:

step 401: and carrying out similarity calculation on the natural language information and a plurality of pre-stored standard semantic templates.

The standard semantic template may be composed of semantic component words and semantic rule words, and the semantic component words and the semantic rule words are related to the part of speech of the words in the semantic template and the grammatical relations among the words, so the similarity calculation process may specifically be: the method comprises the steps of firstly identifying words, parts of speech and grammatical relations of the words in natural language information, then identifying semantic component words and semantic rule words according to the parts of speech and the grammatical relations of the words, and then introducing the identified semantic component words and semantic rule words into a vector space model to calculate a plurality of similarities between text contents of the natural language information and a plurality of preset standard semantic templates. In an embodiment of the present invention, words, parts of speech of the words, and grammatical relations between the words in the natural language information may be identified by one or more of the following word segmentation methods: hidden markov model method, forward maximum matching method, reverse maximum matching method and named entity recognition method.

In an embodiment of the present invention, as described above, the standard semantic template may be a set of multiple semantic expressions representing a certain semantic content, and at this time, a sentence with multiple different expression modes of the corresponding semantic content may be described by using one standard semantic template, so as to correspond to multiple extension questions of the same standard question. Therefore, when calculating the similarity between the natural language information and the pre-stored standard semantic template, it is necessary to calculate the similarity between the text content of the natural language information and at least one extension question expanded by each of the plurality of standard semantic templates, and then use the standard semantic template corresponding to the extension question with the highest similarity as the matched semantic template. These expanded questions may be obtained from semantic component words and/or semantic rule words and/or semantic symbols included in the standard semantic template.

Step 402: and acquiring corresponding interactive instruction information according to the standard semantic template with the highest similarity, wherein the mapping relation between the standard semantic template and the interactive instruction information is pre-established.

After the standard semantic template is determined, the user intention is known, and the corresponding interactive instruction information can be directly determined according to the mapping relation between the standard semantic template and the interactive instruction information. For example, the content expressed in the natural language by the user is "so it is true not to miss! "the natural language information is matched to the content" so good! "and the interactive instruction information corresponding to the standard semantic template is the action definition instruction information with the contents being nodding.

Step 303: inputting the interactive instruction information into an action generating model, wherein the action generating model is configured to output a plurality of action data respectively corresponding to a plurality of key points of the avatar according to the interactive instruction information.

Similar to step 102 of the avatar interaction method shown in fig. 1, the action generation model may output corresponding action data of key points of the head region of the avatar according to the action definition instruction information whose contents are nods.

Step 304: and driving a plurality of key points of the virtual image to respectively generate corresponding actions according to a plurality of action data.

Similar to step 103 of the avatar interaction method shown in FIG. 1, the avatar can be directly driven to perform corresponding actions according to the action data of these key points, so that the user can use natural language to express "so really just not miss! And simultaneously, the virtual image makes head nodding action. Therefore, through the virtual image interaction method shown in fig. 3, the user can complete interaction based on the virtual image through natural language expression, the user does not need to deliberately input a clear interaction instruction and remember a specific interaction instruction in advance, but the virtual image can make a corresponding action along with the natural language expression of the user, and therefore the interaction experience of the user is further remarkably improved.

In an embodiment of the present application, as shown in fig. 5, the natural language information input by the user includes voice information based on natural language, and the generating of the interaction instruction information according to the natural language information may specifically include the following steps:

step 501: and extracting the audio characteristic vector or the character characteristic vector of the natural language information.

The audio feature vector or the text feature vector may include at least one audio feature. In this way, all audio features or text features are represented by vectors in at least one-dimensional vector space, each dimension corresponds to a calculation representation mode of one audio feature or text feature in the vector space, the direction and value of the audio feature vector or text feature vector can be regarded as being formed by summing up different respective calculation representation modes of a plurality of audio features or text features in the vector space, and each calculation representation mode of each audio feature or text feature can be regarded as a component of the audio feature vector or text feature vector. The natural language information including different emotions necessarily has different audio features or character features, and the corresponding relation between the different emotions and the different audio features or character features is utilized to identify the emotion of the natural language information.

In an embodiment of the present application, the audio feature vector may include one or more of the following audio features: energy features, frame number of utterance features, pitch frequency features, formant features, harmonic-to-noise ratio features, and mel-frequency cepstral coefficient features. In one embodiment of the present invention, the audio features may be characterized by one or more of the following computational characterization methods: scale, mean, maximum, median, and standard deviation.

In an embodiment of the present application, the text feature vector may include one or more of the following text features: a mood word feature, a verb feature, an adjective feature, a status word feature, and the like. These word features can be obtained by means of text recognition.

Step 502: and matching the audio characteristic vector with a plurality of emotion characteristic models, wherein the emotion characteristic models respectively correspond to one of the emotion classifications.

The emotion feature models can be established by pre-learning the audio feature vectors of the preset natural language information including emotion classification labels corresponding to the emotion classifications, so that the corresponding relationship between the emotion feature models and the emotion classifications is established, and each emotion feature model can correspond to one emotion classification. In an embodiment of the present invention, the plurality of emotion classifications may include: a satisfaction category, a calm category, and a fidgety category to correspond to emotional states that may occur to a user in a customer service interaction scenario. In another embodiment, the plurality of emotion classifications may include: satisfaction classification, calmness classification, fidgetiness classification, and anger classification to correspond to emotional states that may occur to customer service personnel in a customer service interaction scenario.

Step 503: and taking the emotion classification corresponding to the emotion characteristic model with the matched matching result as the emotion classification of the natural language information.

As mentioned above, since there is a corresponding relationship between the emotion feature model and the emotion classification, after the matching emotion feature model is determined according to the matching process of step 502, the emotion classification corresponding to the matching emotion feature model is the recognized emotion classification. For example, when the emotion feature models are gaussian mixture models, the matching process can be implemented by measuring likelihood probabilities between the audio feature vectors of the current natural language information and the emotion feature models, and then, the emotion classification corresponding to the emotion feature model whose likelihood probability is greater than a preset threshold and is the maximum is used as the emotion classification of the natural language information.

Step 504: and acquiring corresponding interactive instruction information according to the emotion classification of the natural language information, wherein the mapping relation between the emotion classification and the interactive instruction information is pre-established.

For example, when the natural language information input by the user is a speech segment with a content of "today is too mildewed", the emotion feature model corresponding to the dysphoria classification can be matched by extracting the audio feature information of the speech segment, and the emotion feature model of the dysphoria classification corresponds to the motion definition instruction information with a content of low heads. At this time, the action generating model can output corresponding action data of key points of the head area of the virtual image according to the action definition instruction information of which the content is the head lowering, so that the virtual image can simultaneously perform the head lowering action while the user expresses that the user is too mildewed today in natural language.

Therefore, by the way of acquiring the interactive instruction information shown in the embodiment of fig. 5, the audio feature vector or the character feature vector of the natural language information is extracted, and the extracted audio feature vector or character feature vector is matched by using the pre-established emotion feature model, so that the real-time emotion recognition of the natural language information is realized. Therefore, the virtual image can make corresponding action according to the real-time emotion of the user, the interaction synchronism between the virtual image and the user is further increased, the user does not need to input a clear interaction instruction intentionally, the user does not need to remember a specific interaction instruction in advance, the virtual image can make the action corresponding to the emotion of the user along with the natural language expression of the user, and the user experience is further improved remarkably.

In another embodiment of the present application, the feature vocabulary in the natural language information may also be recognized, and then the corresponding interactive instruction information may be obtained according to the feature vocabulary, wherein the mapping relationship between the feature vocabulary and the interactive instruction information is pre-established. For example, when the user inputs a speech segment whose content is "today is too mildewed", the feature vocabulary "mildewed" can be recognized, and at this time, the action definition instruction information whose content is low can be directly obtained according to the feature vocabulary.

It should be understood that, although various ways of obtaining the interactive instruction information based on the natural language information are given above, in other embodiments of the present application, these ways of obtaining the interactive instruction information based on the natural language information can be freely combined, and through cooperation of various ways, the avatar can better reflect the intention and emotion of the user in real time, thereby further improving the user experience based on avatar interaction.

Fig. 6 is a schematic structural diagram of an avatar interaction apparatus according to an embodiment of the present application. As shown in fig. 6, the avatar interacting apparatus 60 includes: an acquisition module 601, an action generation model 602 and a driving module 603. The obtaining module 601 is configured to obtain interaction instruction information of a user, the action generating model 602 is configured to output a plurality of action data corresponding to a plurality of key points of the avatar according to the interaction instruction information from the obtaining module 601, and the driving module 603 is configured to drive the plurality of key points of the avatar to generate corresponding actions according to the plurality of action data.

In an embodiment of the present application, as shown in fig. 7, the avatar interacting device 60 further includes an action-generating model training module 604, including: a first acquisition unit 6041, a second acquisition unit 6042, and a training unit 6043. The first acquisition unit 6041 is configured to acquire a plurality of motion data samples corresponding to the plurality of key points, respectively; the second acquiring unit 6042 is configured to acquire a plurality of interaction instruction samples corresponding to a plurality of motion data samples; the training unit 6043 is configured to generate the motion generation model 602 by deep learning process training with a plurality of motion data samples and a plurality of interaction instruction samples as a training set.

In an embodiment of the application, the deep learning process is implemented based on a generative model and a sequence model.

In an embodiment of the present application, the first obtaining unit 6041 is further configured to: inputting the image data into a motion recognition model, wherein the motion recognition model is configured to output positions and motion trajectories of a plurality of key points according to the image data; identifying positions and motion tracks of the plurality of key points to generate a plurality of motion data samples respectively corresponding to the plurality of key points;

or, the first acquiring unit 6041 is further configured to: and acquiring a plurality of preset action data samples respectively corresponding to a plurality of key points of the virtual image.

In an embodiment of the present application, the interactive instruction information includes one or more of the following information: voice instruction information, text instruction information, action definition instruction information, and emotion definition instruction information.

In an embodiment of the present application, as shown in fig. 7, the avatar interacting device 60 further includes: a receiving module 605 configured to receive natural language information of a user, wherein the natural language information includes voice information and/or text information based on a natural language; wherein the obtaining module 601 is further configured to: and generating interactive instruction information according to the natural language information.

In an embodiment of the present application, as shown in fig. 7, the obtaining module 601 includes: a calculating unit 6011 configured to perform similarity calculation between the natural language information and a plurality of pre-stored standard semantic templates; and a third obtaining unit 6012, configured to obtain the corresponding interactive instruction information according to the standard semantic template with the highest similarity, where a mapping relationship between the standard semantic template and the interactive instruction information is pre-established.

In one embodiment of the present application, the natural language information includes natural language based speech information; as shown in fig. 7, the obtaining module 601 includes: an extraction unit 6013 configured to extract an audio feature vector of the natural language information; a matching unit 6014 configured to match the audio feature vector with a plurality of emotion feature models, where the plurality of emotion feature models respectively correspond to one of the plurality of emotion classifications; the first determining unit 6015 is configured to use the emotion classification corresponding to the emotion feature model with the matching result as the emotion classification of the natural language information; and a fourth obtaining unit 6016, configured to obtain the corresponding interactive instruction information according to the emotion classification of the natural language information, where a mapping relationship between the emotion classification and the interactive instruction information is pre-established.

In an embodiment of the present application, as shown in fig. 7, the obtaining module 601 includes: a recognition unit 6017 configured to recognize a feature vocabulary in the natural language information; and a fifth obtaining unit 6018, configured to obtain corresponding interactive instruction information according to the feature vocabulary, where a mapping relationship between the feature vocabulary and the interactive instruction information is pre-established.

In an embodiment of the application, the avatar is a human-like avatar, wherein the plurality of key points include one or more of the following combinations: body joint feature points, body bone part feature points, facial expression feature points, and mouth feature points.

Therefore, according to the avatar interaction apparatus provided by the embodiment of the application, the action generation model is adopted to obtain the plurality of action data respectively corresponding to the plurality of key points of the avatar based on the interaction instruction information of the user, so that when the user interacts with the avatar, the user does not need to generate a complete image corresponding to the action of the avatar, and the corresponding action can be generated by only driving the plurality of key points of the avatar based on the action data. The data volume of the action data of the key points is small, so that the requirement on equipment hardware can be greatly reduced, the real-time interaction experience of a user is improved, some low-configuration hardware terminals can still run the virtual image interaction mode, the application terminal range of the virtual image interaction mode is expanded, and the hardware budget requirement is reduced.

The detailed functions and operations of the respective modules in the above-described avatar interaction apparatus 60 have been described in detail in the avatar interaction method described above with reference to fig. 1 to 5, and thus, a repetitive description thereof will be omitted herein.

It should be noted that the avatar interacting device 60 according to the embodiment of the present application may be integrated into the electronic apparatus 80 as a software module and/or a hardware module, in other words, the electronic apparatus 80 may include the avatar interacting device 60. For example, the avatar interaction device 60 may be a software module in the operating system of the electronic device 80, or may be an application developed therefor; of course, the avatar interaction device 60 may also be one of many hardware modules of the electronic device 80.

In another embodiment of the present application, the avatar interacting device 60 and the electronic equipment 80 may also be separate devices (e.g., servers), and the avatar interacting device 60 may be connected to the electronic equipment 80 through a wired and/or wireless network and transmit interaction information according to an agreed data format.

Fig. 8 is a schematic structural diagram of an electronic device according to an embodiment of the present application. As shown in fig. 8, the electronic apparatus 80 includes: one or more processors 801 and memory 802; and computer program instructions stored in the memory 802, which when executed by the processor 801, cause the processor 801 to perform the avatar interaction method as in any of the embodiments described above.

The processor 801 may be a Central Processing Unit (CPU) or other form of processing unit having data processing capabilities and/or instruction execution capabilities, and may control other components in the electronic device to perform desired functions.

Memory 802 may include one or more computer program products that may include various forms of computer-readable storage media, such as volatile memory and/or non-volatile memory. Volatile memory can include, for example, Random Access Memory (RAM), cache memory (or the like). The non-volatile memory may include, for example, Read Only Memory (ROM), a hard disk, flash memory, and the like. One or more computer program instructions may be stored on a computer readable storage medium and executed by the processor 801 to implement the steps of the avatar interaction method of the various embodiments of the present application described above and/or other desired functions. Information such as light intensity, compensation light intensity, position of the filter, etc. may also be stored in the computer readable storage medium.

In one example, the electronic device 80 may further include: an input device 803 and an output device 804, which are interconnected by a bus system and/or other form of connection mechanism (not shown in fig. 8).

For example, when the electronic device is a stand-alone device, the input means 803 may be a communication network connector for receiving the collected input signal from an external removable device. The input device 803 may also include, for example, a keyboard, a mouse, a microphone, and so forth.

The output device 804 may output various information to the outside, and may include, for example, a display, speakers, a printer, and a communication network and its connected remote output devices, among others.

Of course, for simplicity, only some of the components of the electronic device 80 relevant to the present application are shown in fig. 8, and components such as buses, input devices/output interfaces, and the like are omitted. In addition, the electronic device 80 may include any other suitable components depending on the particular application.

In addition to the above-described methods and apparatus, embodiments of the present application may also be a computer program product comprising computer program instructions which, when executed by a processor, cause the processor to perform the steps in the avatar interaction method of any of the above-described embodiments.

The computer program product may include program code for carrying out operations for embodiments of the present application in any combination of one or more programming languages, including an object oriented programming language such as Java, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device and partly on a remote computing device, or entirely on the remote computing device or server.

Furthermore, embodiments of the present application may also be a computer-readable storage medium having stored thereon computer program instructions which, when executed by a processor, cause the processor to perform the steps in the avatar interaction method of the various embodiments of the present application.

A computer-readable storage medium may employ any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. A readable storage medium may include, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium include: an electrical connection having one or more wires, a portable disk, a hard disk, a random access memory ((RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

The foregoing describes the general principles of the present application in conjunction with specific embodiments, however, it is noted that the advantages, effects, etc. mentioned in the present application are merely examples and are not limiting, and they should not be considered essential to the various embodiments of the present application. Furthermore, the foregoing disclosure of specific details is for the purpose of illustration and description and is not intended to be limiting, since the foregoing disclosure is not intended to be exhaustive or to limit the disclosure to the precise details disclosed.

The block diagrams of devices, apparatuses, systems referred to in this application are only given as illustrative examples and are not intended to require or imply that the connections, arrangements, configurations, etc. must be made in the manner shown in the block diagrams. These devices, apparatuses, devices, systems may be connected, arranged, configured in any manner, as will be appreciated by those skilled in the art. Words such as "including," "comprising," "having," and the like are open-ended words that mean "including, but not limited to," and are used interchangeably therewith. The words "or" and "as used herein mean, and are used interchangeably with, the word" and/or, "unless the context clearly dictates otherwise. The word "such as" is used herein to mean, and is used interchangeably with, the phrase "such as but not limited to".

It should also be noted that in the devices, apparatuses, and methods of the present application, the components or steps may be decomposed and/or recombined. These decompositions and/or recombinations are to be considered as equivalents of the present application.

The previous description of the disclosed aspects is provided to enable any person skilled in the art to make or use the present application. Various modifications to these aspects will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other aspects without departing from the scope of the application. Thus, the present application is not intended to be limited to the aspects shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

The foregoing description has been presented for purposes of illustration and description. Furthermore, the description is not intended to limit embodiments of the application to the form disclosed herein. While a number of example aspects and embodiments have been discussed above, those of skill in the art will recognize certain variations, modifications, alterations, additions and sub-combinations thereof.

The above description is only exemplary of the present application and should not be taken as limiting the present application, as any modifications, equivalents and the like that are within the spirit and principle of the present application should be included in the scope of the present application.

Claims

1. An avatar interaction method, comprising:

acquiring interactive instruction information of a user;

inputting the interaction instruction information into an action generating model, wherein the action generating model is configured to output a plurality of action data respectively corresponding to a plurality of key points of the virtual image according to the interaction instruction information; and

and driving the plurality of key points of the virtual image to respectively generate corresponding actions according to the plurality of action data.

2. The method of claim 1, wherein the action generating model is built based on a training process that:

obtaining a plurality of action data samples respectively corresponding to the plurality of key points;

obtaining a plurality of interaction instruction samples corresponding to the plurality of action data samples; and

and taking the plurality of action data samples and the plurality of interaction instruction samples as a training set, and training through a deep learning process to generate the action generation model.

3. The method of claim 1, wherein the deep learning process is implemented based on a generative model and a sequence model.

4. The method of claim 2, wherein the obtaining a plurality of motion data samples corresponding to the plurality of keypoints, respectively, comprises:

inputting image data into a motion recognition model, wherein the motion recognition model is configured to output positions and motion trajectories of the plurality of key points according to the image data; identifying positions and motion tracks of the plurality of key points to generate a plurality of motion data samples respectively corresponding to the plurality of key points; or the like, or, alternatively,

and acquiring a plurality of preset action data samples respectively corresponding to the plurality of key points of the virtual image.

5. The method of claim 1, wherein the interactive instruction information comprises one or more of the following: voice instruction information, character instruction information, action definition instruction information, emotion definition instruction information, sensor information, and an image recognition result.

6. The method of claim 1, further comprising:

receiving natural language information of a user, wherein the natural language information comprises voice information and/or text information based on natural language;

wherein, the acquiring of the interactive instruction information of the user comprises:

and generating the interactive instruction information according to the natural language information.

7. The method of claim 6, wherein generating the interactivity instruction information from the natural language information comprises:

carrying out similarity calculation on the natural language information and a plurality of pre-stored standard semantic templates; and

and acquiring corresponding interactive instruction information according to the standard semantic template with the highest similarity, wherein a mapping relation between the standard semantic template and the interactive instruction information is pre-established.

8. The method of claim 6, wherein the natural language information comprises natural language based speech information;

wherein the generating of the interaction instruction information according to the natural language information includes:

extracting audio characteristic vectors or character characteristic vectors of the natural language information;

matching the audio characteristic vector or the character characteristic vector with a plurality of emotion characteristic models, wherein the emotion characteristic models respectively correspond to one of a plurality of emotion classifications; and

taking the emotion classification corresponding to the emotion characteristic model with the matched matching result as the emotion classification of the natural language information; and

and acquiring corresponding interactive instruction information according to the emotion classification of the natural language information, wherein the mapping relation between the emotion classification and the interactive instruction information is pre-established.

9. The method of claim 6, wherein generating the interactivity instruction information from the natural language information comprises:

identifying characteristic words in the natural language information; and

and acquiring the corresponding interactive instruction information according to the characteristic vocabulary, wherein the mapping relation between the characteristic vocabulary and the interactive instruction information is pre-established.

10. The method of claim 1, wherein the avatar is a humanoid avatar, and wherein the plurality of key points comprise one or more of the following in combination: body joint feature points, body bone part feature points, facial expression feature points, and mouth feature points.

11. An avatar interaction apparatus, comprising:

the acquisition module is configured to acquire interactive instruction information of a user;

the action generation model is configured to output a plurality of action data respectively corresponding to a plurality of key points of the virtual image according to the interactive instruction information from the acquisition module; and

and the driving module is configured to drive the plurality of key points of the virtual image to respectively generate corresponding actions according to the plurality of action data.

12. An electronic device, comprising:

a processor; and

a memory having stored therein computer program instructions which, when executed by the processor, cause the processor to perform the avatar interaction method of any of claims 1-10.

13. The electronic device of claim 12, further comprising:

a display screen for displaying the avatar.

14. A computer readable storage medium having stored thereon computer program instructions which, when executed by a processor, cause the processor to perform the avatar interaction method of any of claims 1-10.