CN113763532A

CN113763532A - Human-computer interaction method, device, equipment and medium based on three-dimensional virtual object

Info

Publication number: CN113763532A
Application number: CN202110416949.XA
Authority: CN
Inventors: 李晶; 康頔; 暴林超
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2021-04-19
Filing date: 2021-04-19
Publication date: 2021-12-07
Anticipated expiration: 2041-04-19
Also published as: CN113763532B

Abstract

The application discloses a human-computer interaction method, a human-computer interaction device, human-computer interaction equipment and human-computer interaction media based on three-dimensional virtual objects, and belongs to the technical field of artificial intelligence. The method comprises the following steps: acquiring voice data; carrying out feature coding on voice data by using an audio coder based on a deep learning network to obtain a first audio feature; the motion decoder based on the deep learning network performs motion decoding on the first audio features to obtain attitude data of each joint of the three-dimensional virtual object; the attitude data is used for indicating the rotation angle of each joint in a three-dimensional space; the number of convolution kernels of the last convolution layer in the motion decoder is related to the number of joints of the three-dimensional virtual object and the dimensionality of the attitude data; and driving the three-dimensional virtual object to execute corresponding actions based on the attitude data of each joint. The synthetic action of this application is smooth nature, has more the authenticity.

Description

Human-computer interaction method, device, equipment and medium based on three-dimensional virtual object

Technical Field

The present application relates to the field of artificial intelligence technologies, and in particular, to a human-computer interaction method, apparatus, device, and medium based on a three-dimensional virtual object.

Background

In the era of Artificial Intelligence (AI), AI has been applied to various fields, such as a three-Dimensional (3-Dimensional, 3D) virtual object. The 3D virtual object gives the entertainment character the capability of multi-mode interaction by utilizing AI technologies such as voice interaction, virtual image generation and the like, so as to assist the intelligent entertainment double-upgrade of industries such as media, education, exhibition, customer service and the like.

The user can interact with the 3D virtual object to meet the information requirement, the emotion requirement or the entertainment requirement and the like of the user. In the related art, an audio-driven 3D virtual object is generally implemented based on a scheme of splicing. The scheme first constructs an action library, which takes audio features as keys and actions as values. When the 3D virtual object is driven, firstly, action segments which are most similar to the audio features of the input audio are inquired in an action library, and then actions of the segments are spliced together by adopting a splicing algorithm to form a synthetic action.

However, the combined action of the scheme is stiff and rigid on one hand and low in fidelity. For example, when a long action is generated, some action segments in the action library may be repeatedly retrieved, and the action synthesis effect is obviously not real enough. On the other hand, this type of scheme can synthesize only the existing actions in the action library, and cannot synthesize the actions that do not exist in the action library.

Disclosure of Invention

The embodiment of the application provides a human-computer interaction method, a human-computer interaction device, human-computer interaction equipment and a human-computer interaction medium based on a three-dimensional virtual object, and actions synthesized by the method are smooth and natural and have more authenticity. In addition, various actions can be synthesized to drive the three-dimensional virtual object, and the method is not limited at all and is intelligent. The technical scheme is as follows:

in one aspect, a human-computer interaction method based on a three-dimensional virtual object is provided, and the method includes:

acquiring voice data;

carrying out feature coding on the voice data by an audio coder based on a deep learning network to obtain a first audio feature; the motion decoder based on the deep learning network performs motion decoding on the first audio features to obtain attitude data of each joint of the three-dimensional virtual object;

wherein the pose data is indicative of a rotation angle of the respective joint in three-dimensional space; the number of convolution kernels of the last convolution layer in the motion decoder is related to the number of joints of the three-dimensional virtual object and the dimensionality of the pose data;

and driving the three-dimensional virtual object to execute corresponding actions based on the attitude data of each joint.

In some embodiments, the three-dimensional virtual object is a three-dimensional virtual human, the method further comprising:

acquiring a two-dimensional face image, wherein the two-dimensional face image comprises a target face;

three-dimensional reconstruction is carried out on the two-dimensional face image based on the depth information of the two-dimensional face image to obtain a three-dimensional character model;

and performing image rendering on the three-dimensional character model based on the texture information of the two-dimensional face image to obtain a three-dimensional virtual human corresponding to the target face.

In another aspect, a model training method is provided, the method comprising:

acquiring training data, wherein the training data comprises sample voice data and standard posture data corresponding to the sample voice data;

based on an audio encoder in an initial network, carrying out feature encoding on the sample voice data to obtain a first sample audio feature; based on a motion decoder in the initial network, performing motion decoding on the first sample audio features to obtain predicted attitude data of each joint;

acquiring the predicted position coordinates of each joint based on the predicted attitude data of each joint;

constructing a first loss function based on the standard attitude data and the predicted attitude data;

constructing a second loss function based on the standard position coordinates and the predicted position coordinates of each joint;

and carrying out model training based on the first loss function and the second loss function to obtain a deep learning network.

In another aspect, a human-computer interaction device based on a three-dimensional virtual object is provided, the device including:

a first acquisition module configured to acquire voice data;

the first processing module is configured to perform feature coding on the voice data based on an audio coder of a deep learning network to obtain a first audio feature;

the second processing module is configured to perform motion decoding on the first audio feature based on a motion decoder of the deep learning network to obtain posture data of each joint of the three-dimensional virtual object; wherein the pose data is indicative of a rotation angle of the respective joint in three-dimensional space; the number of convolution kernels of the last convolution layer in the motion decoder is related to the number of joints of the three-dimensional virtual object and the dimensionality of the pose data;

a driving module configured to drive the three-dimensional virtual object to execute a corresponding action based on the posture data of the joints.

In some embodiments, the audio encoder includes M coding blocks connected in series, each two coding blocks are followed by a pooling layer, an mth coding block of the M coding blocks is followed by a convolutional layer, and M is an odd number.

In some embodiments, the first processing module is configured to:

sequentially performing feature extraction and down-sampling on the voice data through the M coding blocks and the pooling layer which are sequentially connected;

inputting the characteristic data output by the Mth coding block into the connected convolution layer for dimension reduction processing to obtain one-dimensional characteristic data;

and performing linear interpolation on the one-dimensional characteristic data to obtain the first audio characteristic.

In some embodiments, the M coding blocks have the same structure, each coding block includes an N-dimensional convolution layer, a batch normalization layer, and an activation function, where N is a positive integer.

In some embodiments, the number of convolution kernels of the last convolution layer is a product of a number of joints of the three-dimensional virtual object and a dimension of the pose data.

In some embodiments, the pose data is in the form of a six-dimensional rotational representation;

the drive module configured to:

transforming the six-dimensional rotation representation data of each joint into a rotation matrix form;

and driving the three-dimensional virtual object to execute the action indicated by the first rotation matrix of each joint.

In some embodiments, the drive module is configured to:

for an ith joint, representing data of six-dimensional rotation of the ith joint as the first two columns of data of a first rotation matrix of the ith joint, wherein i is a positive integer;

respectively carrying out normalization processing on the first two columns of data;

performing orthogonalization processing on the first two columns of normalized data to enable the first two columns of data subjected to the orthogonalization processing to be orthogonal to each other;

and performing cross multiplication on the first two columns of data which are orthogonal to each other to obtain the third column of data of the first rotation matrix of the ith joint.

In some embodiments, the three-dimensional virtual object is a three-dimensional virtual human, and the apparatus further comprises:

the third processing module is configured to acquire a two-dimensional face image, wherein the two-dimensional face image comprises a target face; three-dimensional reconstruction is carried out on the two-dimensional face image based on the depth information of the two-dimensional face image to obtain a three-dimensional character model; and performing image rendering on the three-dimensional character model based on the texture information of the two-dimensional face image to obtain a three-dimensional virtual human corresponding to the target face.

In some embodiments, the obtaining module is configured to take original audio as the voice data; the characteristic length of the first audio characteristic is the same as the frame number of the original audio; or, carrying out audio feature extraction on the original audio to obtain a second audio feature; using the second audio feature as the voice data; the feature length of the first audio feature is the same as the number of frames of the second audio feature.

In another aspect, a model training apparatus is provided, the apparatus comprising:

the second acquisition module is configured to acquire training data, wherein the training data comprises sample voice data and standard posture data corresponding to the sample voice data;

the third processing module is configured to perform feature coding on the sample voice data based on an audio coder in an initial network to obtain a first sample audio feature; based on a motion decoder in the initial network, performing motion decoding on the first sample audio features to obtain predicted attitude data of each joint;

a third obtaining module configured to obtain predicted position coordinates of the respective joints based on the predicted posture data of the respective joints;

a training module configured to construct a first loss function based on the standard pose data and the predicted pose data; constructing a second loss function based on the standard position coordinates and the predicted position coordinates of each joint; and carrying out model training based on the first loss function and the second loss function to obtain the deep learning network.

In some embodiments, the third obtaining module is configured to:

in response to the predicted pose data being in a six-dimensional rotational representation, transforming the predicted pose data for each joint into a rotational matrix form;

and acquiring the predicted position coordinates of each joint based on the second rotation matrix of each joint.

In some embodiments, the third obtaining module is configured to:

for an ith joint, acquiring the predicted position coordinates of a parent joint of the ith joint;

acquiring the displacement of the ith joint relative to the father joint;

and acquiring the predicted position coordinate of the ith joint based on the second rotation matrix of the ith joint, the predicted position coordinate of the father joint and the displacement.

In another aspect, a computer device is provided, the device includes a processor and a memory, the memory stores at least one program code, and the at least one program code is loaded and executed by the processor to implement the above-mentioned three-dimensional virtual object-based human-computer interaction method; or, the model training method described above.

In another aspect, a computer-readable storage medium is provided, in which at least one program code is stored, and the at least one program code is loaded and executed by a processor to implement the above-mentioned three-dimensional virtual object-based human-computer interaction method; or, the model training method described above.

In another aspect, a computer program product or a computer program is provided, the computer program product or the computer program including computer program code, the computer program code being stored in a computer-readable storage medium, the computer program code being read by a processor of a computer device from the computer-readable storage medium, the computer program code being executed by the processor to cause the computer device to perform the three-dimensional virtual object based human-computer interaction method described above; or, the model training method described above.

After the voice data are obtained, the posture data of each joint can be directly output based on the deep learning network, wherein the posture data are used for indicating the rotation angle of each joint in a three-dimensional space, namely the deep learning network can directly predict the 3D rotation angle based on the voice data, and accordingly the three-dimensional virtual image can be directly driven to execute corresponding actions. The method has the advantages of no accumulative error, high accuracy, simple process, smooth and natural synthetic action, no blur or deformation and the like, and better authenticity and synthetic effect. In addition, the embodiment of the application can synthesize various actions to drive the three-dimensional virtual object, is not limited at all, and is intelligent.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

Fig. 1 is a schematic diagram of an implementation environment related to a three-dimensional virtual object-based human-computer interaction method provided by an embodiment of the application;

FIG. 2 is a schematic diagram of an implementation environment related to another human-computer interaction method based on a three-dimensional virtual object provided by an embodiment of the application;

FIG. 3 is a schematic diagram of an implementation environment related to another human-computer interaction method based on a three-dimensional virtual object provided by an embodiment of the application;

FIG. 4 is a flowchart of a human-computer interaction method based on a three-dimensional virtual object according to an embodiment of the present application;

fig. 5 is a schematic structural diagram of a deep learning network provided in an embodiment of the present application;

fig. 6 is a schematic structural diagram of an audio encoder according to an embodiment of the present application;

FIG. 7 is a flow chart of a model training method provided by an embodiment of the present application;

FIG. 8 is a schematic diagram of a model training process provided by an embodiment of the present application;

FIG. 9 is a schematic diagram of a three-dimensional virtual object provided by an embodiment of the present application;

FIG. 10 is a schematic structural diagram of a human-computer interaction device based on a three-dimensional virtual object according to an embodiment of the present application;

FIG. 11 is a schematic structural diagram of a model training apparatus according to an embodiment of the present disclosure;

FIG. 12 is a schematic structural diagram of a computer device according to an embodiment of the present disclosure;

fig. 13 is a schematic structural diagram of another computer device provided in an embodiment of the present application.

Detailed Description

To make the objects, technical solutions and advantages of the present application more clear, embodiments of the present application will be described in further detail below with reference to the accompanying drawings.

The terms "first," "second," and the like, in this application, are used for distinguishing between similar items and items that have substantially the same function or similar functionality, and it should be understood that "first," "second," and "nth" do not have any logical or temporal dependency, nor do they define a quantity or order of execution. It will be further understood that, although the following description uses the terms first, second, etc. to describe various elements, these elements should not be limited by these terms.

These terms are only used to distinguish one element from another. For example, a first element can be termed a second element, and, similarly, a second element can also be termed a first element, without departing from the scope of various examples. The first element and the second element may both be elements, and in some cases, may be separate and distinct elements.

For example, at least one element may be an integer number of elements equal to or greater than one, such as one element, two elements, three elements, and the like. And at least two means two or more, for example, at least two elements may be any integer number of two or more, such as two elements, three elements, and the like.

The embodiment of the application provides a human-computer interaction scheme based on a three-dimensional virtual object, and relates to an AI technology.

The artificial intelligence is a theory, a method, a technology and an application system which simulate, extend and expand human intelligence by using a digital computer or a machine controlled by the digital computer, sense the environment, acquire knowledge and obtain the best result by using the knowledge. In other words, artificial intelligence is a comprehensive technique of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making.

The artificial intelligence technology is a comprehensive subject and relates to the field of extensive technology, namely the technology of a hardware level and the technology of a software level. The artificial intelligence infrastructure generally includes technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and the like.

Computer Vision technology (CV) is a science for researching how to make a machine "see", and further refers to that a camera and a Computer are used to replace human eyes to perform machine Vision such as identification, tracking and measurement on a target, and further image processing is performed, so that the Computer processing becomes an image more suitable for human eyes to observe or is transmitted to an instrument to detect. As a scientific discipline, computer vision research-related theories and techniques attempt to build artificial intelligence systems that can capture information from images or multidimensional data. The computer vision technology generally includes image processing, image Recognition, image semantic understanding, image retrieval, Optical Character Recognition (OCR), video processing, video semantic understanding, video content/behavior Recognition, three-dimensional object reconstruction, 3D technology, virtual reality, augmented reality, synchronous positioning, map construction, and other technologies, and also includes common biometric technologies such as face Recognition and fingerprint Recognition.

Machine Learning (ML) is a multi-domain cross discipline, and relates to a plurality of disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory and the like. The special research on how a computer simulates or realizes the learning behavior of human beings so as to acquire new knowledge or skills and reorganize the existing knowledge structure to continuously improve the performance of the computer. Machine learning is the core of artificial intelligence, is the fundamental approach for computers to have intelligence, and is applied to all fields of artificial intelligence. Machine learning and deep learning generally include techniques such as artificial neural networks, belief networks, reinforcement learning, transfer learning, inductive learning, and teaching learning.

With the research and progress of artificial intelligence technology, the artificial intelligence technology is developed and applied in a plurality of fields, such as common smart homes, smart wearable devices, virtual assistants, smart speakers, smart marketing, unmanned driving, automatic driving, unmanned aerial vehicles, robots, smart medical care, smart customer service, and the like.

The human-computer interaction scheme based on the three-dimensional virtual object provided by the embodiment of the application may relate to artificial intelligence technologies such as a computer vision technology, a voice technology, natural language processing, machine learning and the like, and is specifically described by the following embodiment.

Some key terms or abbreviations that may be involved in embodiments of the present application are described below.

The offspring joints: the rotation of one joint affects the other part of the joint, which is called the descendant joint of the joint. For example, the joints of all fingers are descendant joints of the wrist.

A father joint: if the a joint is a descendant joint of the b joint and the a joint is directly connected to the b joint, the b joint is defined as a parent joint of the a joint. For example, the elbow is the father joint of the wrist and the shoulder is the father joint of the elbow.

Forward Kinetics (FK): also known as forward dynamics. Forward dynamics are affected level by level from the parent joint to the offspring joints, with the term "forward" meaning that the parent joint can affect the offspring joints, while the offspring joints cannot. Wherein, the forward dynamics algorithm can calculate the position coordinates of each joint according to the rotation angle of each joint.

Inverse kinetics (Inverse kinetics, IK): the method is an inverse process of forward dynamics, and a reverse dynamics algorithm can calculate the possible rotation angle of each joint according to the position coordinates of each joint.

Mel spectrum (Mel spectrum): an audio feature. This feature is designed to depend on how sensitive the human ear is to sounds of different frequencies. In order to obtain an audio feature with a suitable size, a spectrogram is often transformed into a mel-scale spectrum through a mel-scale filter banks (mel-scale filters).

log-mel: refers to the base-10 logarithm of the value of the mel-frequency spectral feature.

End-to-end: the model maps the inputs directly to the required outputs without other intermediate processes.

The following describes an implementation environment related to a human-computer interaction method based on a three-dimensional virtual object provided by the embodiment of the present application.

Referring to fig. 1, the implementation environment includes: a training device 110 and an application device 120.

In the training phase, the training apparatus 110 is used to train a deep learning network. In the application phase, the application device 120 may utilize the trained deep learning network to implement audio-driven three-dimensional virtual objects based on deep learning. In another expression, the deep learning network is designed in the embodiment of the application, and the audio-driven three-dimensional virtual object can be used end to end in the deep learning network.

Optionally, the training device 110 and the application device 120 are computer devices, for example, the computer devices may be terminals or servers. In some embodiments, the server may be an independent physical server, may also be a server cluster or a distributed system formed by a plurality of physical servers, and may also be a cloud server that provides basic cloud computing services such as a cloud service, a cloud database, cloud computing, a cloud function, cloud storage, a Network service, cloud communication, a middleware service, a domain name service, a security service, a CDN (Content Delivery Network), a big data and artificial intelligence platform, and the like. The terminal may be, but is not limited to, a smart phone, a tablet computer, a laptop computer, a desktop computer, a smart speaker, a smart watch, and the like. The terminal and the server may be directly or indirectly connected through wired or wireless communication, and the application is not limited herein.

In another embodiment, the training device 110 and the application device 120 may be the same device, or the training device 110 and the application device 120 may be different devices. Also, when the training device 110 and the application device 120 are different devices, the training device 110 and the application device 120 may be the same type of device, for example, the training device 110 and the application device 120 may both be terminals; alternatively, the training device 110 and the application device 120 may be different types of devices, for example, the training device 110 may be a server, and the application device 120 may be a terminal, etc. The application is not limited thereto.

An application scenario of the human-computer interaction method based on the three-dimensional virtual object provided by the embodiment of the present application is introduced below.

In the real world, when people speak, gestures are accompanied, the gestures reflect the emotional state of the speaker, and play a key role in information transmission. Therefore, the three-dimensional virtual object presented through the display screen of the computer device also needs to be accompanied by gestures during speaking, so as to achieve a realistic effect and facilitate the user to perceive the emotion of the three-dimensional virtual object. According to the embodiment of the application, the three-dimensional virtual object is driven by the audio to execute the action, and the consciousness of the user is substituted into a virtual world, so that the conversation experience close to an offline conversation mode is obtained.

Optionally, the scheme provided by the embodiment of the application is applicable to any scene needing to synthesize the motion of the three-dimensional virtual object. Such as virtual anchor, virtual commentary, virtual gate greeting, virtual shopping guide, etc.

Optionally, the three-dimensional virtual object refers to a 3D virtual person presented through a display screen of a computer device, supports free face pinching, and can also freely reload the 3D virtual person.

Example one, virtual Anchor

Aiming at the requirements of media scenes such as news broadcasting, game explanation and television broadcasting guide, the 3D virtual human can be represented as a virtual anchor, and corresponding services are provided for users. By utilizing the virtual anchor, the labor production cost can be reduced, and meanwhile, differentiated brands with more topic feelings and attention degrees can be created. Taking the example of presenting a virtual anchor in a live broadcast room, as shown in fig. 2, the same 3D avatar is presented on the live broadcast room interfaces of the anchor terminal and at least one viewer terminal. Optionally, a anchor user of the live broadcast room may interact with the 3D avatar. In addition, the 3D virtual human can speak and make corresponding actions driven by voice.

Example two, virtual teacher

Aiming at the education scene requirements of network teaching, on-line problem solving and the like, the 3D virtual human avatar is a virtual teacher, medium and small hardware equipment such as a flat plate or an intelligent teaching screen is implanted, and one-to-one exclusive teaching service is provided for students. The virtual teacher can reduce the labor cost for producing teaching contents, can effectively improve the teaching reliability, and arouses the learning interest of students.

Example three, virtual customer service

Aiming at the requirement of a customer service scene, the 3D virtual human avatar is virtually served by customers, and a large-screen all-in-one machine or a network page is implanted to provide question and answer service for users. The virtual customer service introduces a three-dimensional virtual image on the basis of the intelligent voice customer service, provides timely response, and creates more compatible and natural customer service experience.

Example four, virtual Assistant

Aiming at the scene requirements of intelligent assistants such as music playing, weather inquiry and chatting dialogue, the 3D virtual human avatar virtual assistant implants equipment such as Internet of Things (IoT) hardware, mobile terminal Application programs (APP) or a vehicle machine, and provides convenient life service for users. After the voice assistant is energized through multi-mode interaction, the voice assistant can become an all-round intelligent assistant capable of speaking.

Example five, virtual tour guide

Aiming at tourism scene requirements of scenic spot navigation, scenic spot inquiry and the like, the 3D virtual human avatar virtual navigation is implanted with a mobile phone App and a small program, and services such as scenic spot navigation, explanation and the like can be provided for tourists. The method can help the travel brand to further penetrate influence, provide differentiated services and help to create viscous ecological content.

Optionally, for scenarios such as virtual customer service, virtual assistant, virtual tour guide, etc., the 3D avatar may be presented through IoT hardware, a mobile terminal APP, or a car machine, etc. Taking a virtual tour guide as an example, as shown in fig. 3, a 3D virtual person can be displayed through a related APP installed on a mobile terminal, and the 3D virtual person can perform tour guide explanation for a user and perform corresponding actions driven by voice.

Example six, Brand marketing

Aiming at brand marketing scenes, the 3D virtual human can become a brand-new marketing interest device. The virtual human is alive, and consumers who originally and passively accept marketing start to participate in interaction personally, so that brand charm is fully experienced. The method has the advantages that the deep conversation and the real and interesting interaction are memorized, and the user can be given a more deep impression to detonate the topic heat.

It should be noted that the application scenarios described above are only used for illustrating the embodiments of the present application and are not limited. In practical implementation, the technical scheme provided by the embodiment of the application can be flexibly applied according to practical needs.

The following description will be made of a human-computer interaction scheme based on a three-dimensional virtual object according to an embodiment of the present application.

The embodiment of the application designs a deep learning network, which takes original audio or audio features as input and takes attitude data of each joint of a three-dimensional virtual object as output. Alternatively, the audio features may be log-Mel features, Mel-Frequency cepstrum coefficient (MFCC), chroma features, or the like, and the application is not limited thereto.

Alternatively, the attitude data may be a 6D rotation representation, a rotation matrix, a rotation vector, a quaternion, or the like, which is not limited herein.

Wherein the rotation matrix is a matrix for representing rotation. In three-dimensional space, the rotation matrix is 3 × 3 in size. The matrix satisfies the following properties: a. each row and column of the matrix are mutually orthogonal. b. Each row and column of the matrix is a unit vector. In addition, the rotation matrix has 9 elements, but has only three degrees of freedom, so that rows and columns of the rotation matrix are orthogonal to each other, and the rotation matrix does not include scale transformation, so that the rotation matrix is orthogonal.

In addition, the rotation matrix is not beneficial to the optimization of the deep learning algorithm due to the fact that certain constraint is required to be met, and the 6D rotation represents that the optimization of the deep learning algorithm is more beneficial due to the fact that the characteristics of no jumping point, no constraint and the like exist on the numerical value. That is, the 6D rotation representation is a continuous three-dimensional rotation representation. Here, 6D means 6 degrees of freedom, and represents displacement (Translation) of 3 degrees of freedom and spatial Rotation (Rotation) of 3 degrees of freedom. In addition, the rotation matrix is also a continuous three-dimensional rotation representation; while the rotation vector and quaternion are both discrete three-dimensional representations of the rotation.

Wherein the rotation vector is in the form of a three-dimensional vector to represent the rotation. Any rotation in three-dimensional space can be represented by a rotation through a certain angle around a certain axis in three-dimensional space. That is, the rotation vector is a vector representing rotation by one rotation axis and one rotation angle.

In addition, the representation of the rotation matrix has redundancy (3 degrees of freedom rotation is represented using 9 quantities), and the euler angles and rotation vectors are compact but have singularity. Therefore, there is also a common way to represent rotation, i.e., quaternion. Where quaternions represent rotation in the form of complex numbers.

In some embodiments, the present embodiments take log-mel features of the original audio as model inputs and 6D rotated representations as model outputs. Wherein the 6D rotation representation data is used to indicate the rotation angles of the respective joints of the three-dimensional virtual object in three-dimensional space.

In the training stage, by taking log-mel characteristics and 6D rotation expression as examples, the embodiment of the application calculates the position coordinates of each joint of the three-dimensional virtual object by using forward dynamics, and then optimizes the rotation expression and the position coordinates of each joint simultaneously by using an optimization algorithm, so that the model can directly output a 3D rotation angle finally. The synthetic action of the scheme is smooth and natural, no network error accumulation exists by adopting a deep learning network, and the output of the model can be directly used for driving the three-dimensional virtual object.

In addition, the corresponding rotation matrix can be very easily calculated from the 6D rotation representation. The 6D rotation representation output by the model is the first two columns of the rotation matrix, the two columns are normalized respectively, and the two columns are orthogonal to each other through Schmidt orthogonalization. And finally, calculating a third column of the rotation matrix through cross multiplication.

Wherein forward dynamics is an algorithm that calculates the position of each joint based on a rotational representation of the joint, expressed as the following equation: p_n＝P_parent(n)+R_nS_n；P_nThe table refers to the position coordinates of the nth joint; p_parent(n)Position coordinates of a parent joint referring to an nth joint; r_nRefers to the rotation of the nth joint relative to the parent node; s_nRefers to the displacement of the nth joint relative to the parent node, which is fixed for each three-dimensional virtual object.

Fig. 4 is a flowchart of a human-computer interaction method based on a three-dimensional virtual object according to an embodiment of the present application. Referring to fig. 4, in the application stage, the method flow provided by the embodiment of the present application is described by taking the input of the deep learning network as an example of the audio feature.

401. Voice data is acquired.

In some embodiments, obtaining voice data includes, but is not limited to, the following two ways: acquiring original audio; taking original audio as the voice data; or, audio feature extraction is carried out on the original audio to obtain a second audio feature; the second audio characteristic is taken as the speech data.

In this embodiment of the application, the three-dimensional virtual object may be a 3D virtual human or a 3D virtual animal, and the application is not limited herein. Optionally, the embodiment of the application supports pinching the face of the three-dimensional virtual object, so that the face image of the user is fused with the three-dimensional model, and further the virtual image which is relatively fit with the actual image of the user is obtained. In some embodiments, in response to the three-dimensional virtual object being a three-dimensional virtual human, the method provided in the embodiments of the present application further includes: acquiring a two-dimensional face image, wherein the two-dimensional face image comprises a target face; three-dimensional reconstruction is carried out on the two-dimensional face image based on the depth information of the two-dimensional face image to obtain a three-dimensional character model; and performing image rendering on the three-dimensional character model based on the texture information of the two-dimensional face image to obtain a three-dimensional virtual human corresponding to the target face.

Optionally, the original audio is user input audio; the second audio characteristic may be a log-mel characteristic, a MFCC characteristic, or a chroma characteristic, etc., and the application is not limited thereto. In addition, prior to feature extraction, the raw audio is typically pre-processed, illustratively including but not limited to framing, pre-enhancement, windowing, noise reduction, and the like. Wherein the framing is used to divide the original audio into a plurality of audio frames. An audio frame typically refers to a small segment of audio of fixed length. Alternatively, the frame length is usually set to 10 to 30ms (milliseconds), i.e. the playing time of an audio frame is 10 to 30ms, so that there are enough periods in a frame and the variation is not too severe.

In some embodiments, the present application performs log-mel feature extraction on the original audio to obtain log-mel features. Exemplarily, a spectrogram is obtained by performing short-time fourier transform on the preprocessed original audio; then the spectrogram is transformed into a Mel spectrum through a Mel scale filter bank; and finally, taking the logarithm with the base 10 as the value of the Mel frequency spectrum characteristic to obtain the log-mel characteristic.

402. And carrying out feature coding on the voice data by using an audio coder based on the deep learning network to obtain a first audio feature.

In some embodiments, referring to fig. 5, the deep learning network includes an Audio Encoder (Audio Encoder)501 and a Motion Decoder (Motion Decoder) 502.

Optionally, the input to the audio encoder 501 is the log-mel feature of 64 × framecount.

Illustratively, in the training phase, the value of framecount is set to 64; the framecount may be of any length during the testing phase and the application phase, and the application is not limited herein.

In other embodiments, referring to fig. 6, the audio encoder 501 includes M coding blocks connected in series, where each two coding blocks are connected to a pooling layer, and an mth coding block of the M coding blocks is connected to a convolutional layer. Optionally, the M coding blocks have the same structure, and each coding block includes an N-dimensional convolution layer, a batch normalization layer (BatchNorm), and an activation function.

Wherein, the value of M is an odd number, and the value of N is a positive integer. The role of the BatchNorm layer is to transform the distribution of the input data to a standard normal distribution with a mean of 0 and a variance of 1 by some normalization means. Optionally, N is a convolution layer, which may be a one-dimensional convolution layer or a two-dimensional convolution layer; the pooling layer may also be eliminated; the activation function may be a Sigmoid function, a tanh function, a Relu function, or the like, and the present application is not limited thereto.

Taking fig. 6 as an example, the most basic module in the audio encoder 501 is called a coding block (block). Illustratively, a block consists of a two-dimensional convolutional layer, a BatchNorm layer, and a Relu activation function. In addition, one pooling layer is connected after every two blocks to down-sample the features. As shown in fig. 6, 7 blocks are connected in series, that is, the characteristics are down-sampled after the 2 nd block, the 4 th block, and the 6 th block.

In other embodiments, based on the structure of the audio encoder shown in fig. 6, the speech data is feature-encoded based on the audio encoder to obtain the first audio feature, which includes but is not limited to: sequentially performing feature extraction and down-sampling on voice data through M coding blocks and a pooling layer which are sequentially connected; inputting the feature data output by the Mth coding block into the connected convolution layer for dimension reduction processing to obtain one-dimensional feature data; performing linear interpolation on the one-dimensional characteristic data to obtain a first audio characteristic; wherein the feature length of the first audio feature is the same as the number of frames of the second audio feature. In addition, if the original audio is taken as an input, the feature length of the first audio feature is the same as the number of frames of the original audio. Optionally, the embodiment of the present application uses a convolution layer with a convolution kernel size of 8 × 3 to change the feature data output by the last coding block into a one-dimensional feature, and then uses linear interpolation to make the feature length the same as the number of frames of the input audio feature.

Illustratively, the linear interpolation method adopted in the embodiment of the present application is quadratic linear interpolation. Linear interpolation refers to a method of determining the value of an unknown quantity between two known quantities using a straight line connecting the two known quantities. In another expression, linear interpolation refers to an interpolation mode in which an interpolation function is a first order polynomial. Mathematically, quadratic linear interpolation is an extension of linear interpolation of 2 variable functions on a square grid, which is mainly done by first performing linear interpolation in one direction and then performing linear interpolation in the other direction.

403. The motion decoder based on the deep learning network performs motion decoding on the first audio features to obtain attitude data of each joint of the three-dimensional virtual object; wherein the pose data is indicative of a rotation angle of each joint in three-dimensional space; the number of convolution kernels of the last convolution layer in the motion decoder and the number of joints of the three-dimensional virtual object are related to the dimensionality of the pose data.

The attitude data may be a 6D rotation representation, a rotation matrix, a rotation vector, a quaternion, and the like, which is not limited herein. In some embodiments, the action decoder 502 is a U-Net network composed of one-dimensional convolutions, and the action decoder 502 decodes the input audio features into 3D rotation angles of actions. The U-net network is divided into a down-sampling stage and an up-sampling stage, the network structure only comprises a convolution layer and a pooling layer, and is not provided with a full connection layer, and the up-sampling stage and the down-sampling stage adopt convolution operation with the same number of layers.

In some embodiments, the number of convolution kernels for the last convolution layer in the motion decoder 502 is the product of the number of joints of the three-dimensional virtual object and the dimensions of the pose data. In other words, it is necessary to ensure that the number of convolution kernels of the last convolution layer in the motion decoder 502 is a dimension of the number of joints x the pose data. If the driven three-dimensional virtual object has 55 joints in total and the pose data is in the 6D rotation representation format, the number of convolution kernels required for the last convolution layer in the motion decoder 502 is 55 × 6 — 330.

404. And driving the three-dimensional virtual object to execute corresponding actions based on the attitude data of each joint of the three-dimensional virtual object.

In some embodiments, based on the pose data of the respective joints, the three-dimensional virtual object is driven to perform corresponding actions, including but not limited to: the six-dimensional rotation representation data of each joint is converted into a rotation matrix form, and the three-dimensional virtual object is driven to execute the motion indicated by the rotation matrix of each joint. For convenience of distinction, the rotation matrices in the application phase are collectively referred to as a first rotation matrix in the embodiments of the present application.

Optionally, for the ith joint, the six-dimensional rotation representation data is transformed into a rotation matrix form, including but not limited to: taking the six-dimensional rotation representation data of the ith joint as the first two columns of data of the first rotation matrix of the ith joint; wherein i is a positive integer, and the ith joint is any joint of the three-dimensional virtual image; respectively carrying out normalization processing on the data of the first two columns; performing orthogonalization processing on the first two columns of normalized data to enable the first two columns of data subjected to the orthogonalization processing to be orthogonal to each other; and performing cross multiplication on the first two columns of data which are orthogonal to each other to obtain the third column of data of the first rotation matrix of the ith joint.

In other embodiments, in the training stage, the sample speech data used for training the deep learning network may be an original sample audio, or may be a sample audio feature obtained by performing audio feature extraction on the original sample audio, which is not limited herein. Taking sample voice data as sample audio as an example, referring to fig. 7, the model training process includes, but is not limited to, the following steps.

701. And acquiring training data, wherein the training data comprises sample audio and standard posture data corresponding to the sample audio.

The training data includes a plurality of sample audios, and each sample audio corresponds to a standard posture data. Namely, the training data comprises sample audio and standard actions corresponding to the sample audio; optionally, the standard action corresponding to the sample audio is stored in the form of a rotation matrix. In another expression, the standard attitude data is in the form of a rotation matrix.

702. And carrying out feature extraction on the sample audio to obtain a first sample audio feature.

The detailed implementation of step 702 can refer to step 401.

703. Performing feature coding on the first sample audio features based on an audio coder in the initial network to obtain second sample audio features; the feature length of the second sample audio feature is the same as the number of frames of the first sample audio feature.

The detailed implementation of step 703 may refer to step 402.

704. And performing motion decoding on the second sample audio features based on a motion decoder in the initial network to obtain predicted attitude data of each joint.

It should be noted that the total number of joints included in the present embodiment is assumed to be the same for different virtual objects. For example, different 3D virtual people each include 55 joints.

The detailed implementation of step 704 can refer to step 403.

705. Based on the predicted attitude data of each joint, predicted position coordinates of each joint are acquired.

In some embodiments, the predicted position coordinates for each joint are obtained based on the predicted pose data for each joint, including but not limited to:

7051. in response to the predicted attitude data being in a six-dimensional rotation representation form, transforming the predicted attitude data of each joint into a rotation matrix form; in the embodiment of the present application, the rotation matrices in the training phase are collectively referred to as a second rotation matrix.

7052. And acquiring the predicted position coordinates of each joint based on the second rotation matrix of each joint.

Optionally, in the embodiment of the application, the predicted position coordinates of each joint are obtained based on the second rotation matrix of each joint according to a forward dynamics algorithm. The method comprises the following steps:

for the ith joint, acquiring the predicted position coordinates of the father joint of the ith joint; acquiring the displacement of the ith joint relative to the father joint; the predicted position coordinates of the ith joint are acquired based on the second rotation matrix of the ith joint, the predicted position coordinates of its parent joint, and the displacement amount.

706. A first loss function is constructed based on the standard pose data and the predicted pose data for each joint.

In an embodiment of the application, the first loss function is a loss function based on joint rotation. Illustratively, the first loss function may be expressed as:

L_rot＝||M-Dec(Enc(A))||₁

where Enc refers to Audio Encoder, Dec refers to Motion Decoder, M refers to standard pose data representing standard actions, Dec (Enc (a)) refers to predicted pose data, and a refers to sample Audio features of the input model. In addition, L can be used in the training process₁In addition to the penalty function, the penalty function may also be replaced by L₂A loss function or a cross entropy loss function, etc., and the application is not limited herein.

707. Acquiring standard position coordinates and predicted position coordinates of each joint; and constructing a second loss function based on the standard position coordinates and the predicted position coordinates of each joint.

Optionally, in response to that the standard posture data is in the form of a rotation matrix, the embodiment of the present application obtains the standard position coordinates of each joint based on the standard posture data of each joint according to a forward dynamics algorithm. Acquiring the standard position coordinates of the father joint of the ith joint for the ith joint; acquiring the displacement of the ith joint relative to the father joint; and acquiring the standard position coordinate of the ith joint based on the standard rotation matrix of the ith joint, the standard position coordinate of the father joint of the ith joint and the displacement.

In an embodiment of the application, the second loss function is a loss function based on the joint position. Illustratively, the second loss function may be expressed as:

L_pos＝||FK(M)-FK(Dec(Enc(A))||₁

where FK (m) denotes standard position coordinates, FK (Dec (enc (a)) denotes predicted position coordinates.

708. And performing model training based on the first loss function and the second loss function to obtain a deep learning network, wherein the deep learning network comprises a trained audio coder and a trained action decoder.

Optionally, an end-to-end training mode may be adopted in the model training, and the present application is not limited herein.

Optionally, the embodiment of the present application optimizes the first loss function and the second loss function based on a back propagation algorithm of gradient descent using an automatic derivation engine.

The overall flow of the training phase is described below with reference to the schematic training process shown in fig. 8.

Taking log-mel characteristics as model input and 6D rotation representation as model output as an example, referring to fig. 8, firstly, carrying out audio characteristic extraction on original audio to obtain log-mel characteristics; then, inputting the log-mel characteristics into a deep learning network for audio coding and action decoding; next, the 6D rotation representation data of each joint output by the deep learning network is post-processed, that is, the 6D rotation representation data is converted into a rotation matrix of each joint. After the rotation matrix of each joint is obtained, the position coordinates of each joint are calculated based on the rotation matrix of each joint by using a forward dynamics algorithm.

The first point to be noted is that, in the above description, the model adopts a supervised training manner, and besides, the model may have other structures, only the input needs to be ensured to be the audio feature, and the output needs to be the 6D rotation representation.

The second point to be explained is that in the application process, log-mel characteristics are still extracted from the original audio, 6D rotation representation data output by the trained deep learning network is converted into a rotation matrix, and the three-dimensional virtual image is driven based on the rotation matrix, so that the position coordinates of each joint are not calculated by utilizing forward dynamics.

After the voice data are obtained, the posture data of each joint can be directly output based on the deep learning network, wherein the posture data are used for indicating the rotation angle of each joint in a three-dimensional space, namely the deep learning network can directly predict the 3D rotation angle based on the voice data, and accordingly the three-dimensional virtual image can be directly driven to execute corresponding actions. The method has the advantages of no accumulative error, high accuracy, simple process, smooth and natural synthesized action as shown in figure 9, no blur or deformation, authenticity and good synthesis effect. In addition, the embodiment of the application can synthesize various actions to drive the three-dimensional virtual object, is not limited by any condition and is intelligent.

Fig. 10 is a schematic structural diagram of a human-computer interaction device based on a three-dimensional virtual object according to an embodiment of the present application. Referring to fig. 10, the apparatus includes:

a first obtaining module 1001 configured to obtain voice data;

a first processing module 1002, configured to perform feature coding on the speech data based on an audio encoder of a deep learning network, so as to obtain a first audio feature;

a second processing module 1003 configured to perform motion decoding on the first audio feature based on a motion decoder of the deep learning network to obtain pose data of each joint of the three-dimensional virtual object; wherein the pose data is indicative of a rotation angle of the respective joint in three-dimensional space; the number of convolution kernels of the last convolution layer in the motion decoder is related to the number of joints of the three-dimensional virtual object and the dimensionality of the pose data;

a driving module 1004 configured to drive the three-dimensional virtual object to perform a corresponding action based on the pose data of the respective joints.

After the voice data are obtained, the posture data of each joint can be directly output based on the deep learning network, wherein the posture data are used for indicating the rotation angle of each joint in a three-dimensional space, namely the deep learning network can directly predict the 3D rotation angle based on the voice data, and accordingly the three-dimensional virtual image can be directly driven to execute corresponding actions. The method has the advantages of no accumulative error, high accuracy, simple process, smooth and natural synthesized action as shown in figure 9, no blur or deformation, authenticity and good synthesis effect. In addition, the embodiment of the application can synthesize various actions to drive the three-dimensional virtual object, is not limited at all, and is intelligent.

In some embodiments, the first processing module is configured to:

the drive module configured to:

In some embodiments, the drive module is configured to:

All the above optional technical solutions may be combined arbitrarily to form the optional embodiments of the present disclosure, and are not described herein again.

Fig. 11 is a schematic structural diagram of a model training apparatus according to an embodiment of the present application. Referring to fig. 11, the apparatus includes:

a second obtaining module 1101 configured to obtain training data, where the training data includes sample voice data and standard posture data corresponding to the sample voice data;

a third processing module 1102, configured to perform feature coding on the sample speech data based on an audio encoder in an initial network, so as to obtain a first sample audio feature; based on a motion decoder in the initial network, performing motion decoding on the first sample audio features to obtain predicted attitude data of each joint;

a third obtaining module 1103 configured to obtain predicted position coordinates of the respective joints based on the predicted attitude data of the respective joints;

a training module 1104 configured to construct a first loss function based on the standard pose data and the predicted pose data; constructing a second loss function based on the standard position coordinates and the predicted position coordinates of each joint; and carrying out model training based on the first loss function and the second loss function to obtain the deep learning network.

According to the method and the device, the deep learning network is generated in the training stage, accordingly, in the application stage, after the voice data are obtained, the posture data of each joint can be directly output based on the deep learning network, wherein the posture data are used for indicating the rotation angle of each joint in the three-dimensional space, namely the deep learning network can directly predict the 3D rotation angle based on the voice data, and accordingly the three-dimensional virtual image can be directly driven to execute corresponding actions. The method has the advantages of no accumulative error, high accuracy, simple process, smooth and natural synthesized action as shown in figure 9, no blur or deformation, authenticity and good synthesis effect. In addition, the embodiment of the application can synthesize various actions to drive the three-dimensional virtual object, is not limited at all, and is intelligent.

In some embodiments, the third obtaining module is configured to:

acquiring the displacement of the ith joint relative to the father joint;

It should be noted that: in the above embodiment, when the human-computer interaction device based on the three-dimensional virtual object performs human-computer interaction and the model training device trains the model, only the division of the functional modules is used as an example, and in practical applications, the function distribution may be completed by different functional modules according to needs, that is, the internal structure of the device may be divided into different functional modules to complete all or part of the functions described above. In addition, the embodiment of the human-computer interaction device based on the three-dimensional virtual object and the embodiment of the human-computer interaction method based on the three-dimensional virtual object, and the embodiment of the model training device and the embodiment of the model training method provided by the above embodiments belong to the same concept, and specific implementation processes thereof are described in detail in the method embodiments and are not described herein again.

Fig. 12 shows a block diagram of a computer device 1200 according to an exemplary embodiment of the present application. Taking a computer device as an example, the computer device 1200 generally includes: a processor 1201 and a memory 1202.

The processor 1201 includes one or more processing cores, such as a 4-core processor, an 8-core processor, or the like. The processor 1201 may be implemented in at least one hardware form of a DSP (Digital Signal Processing), an FPGA (Field-Programmable Gate Array), and a PLA (Programmable Logic Array). The processor 1201 may also include a main processor and a coprocessor, where the main processor is a processor for Processing data in an awake state, and is also called a Central Processing Unit (CPU); a coprocessor is a low power processor for processing data in a standby state. In some embodiments, the processor 1201 may be integrated with a GPU (Graphics Processing Unit) for rendering and drawing content required to be displayed by the display screen. In some embodiments, the processor 1201 may further include an AI (Artificial Intelligence) processor for processing a computing operation related to machine learning.

Memory 1202 may include one or more computer-readable storage media, which may be non-transitory. Memory 1202 may also include high-speed random access memory, as well as non-volatile memory, such as one or more magnetic disk storage devices, flash memory storage devices. In some embodiments, a non-transitory computer readable storage medium in the memory 1202 is used for storing at least one program code for execution by the processor 1201 to implement the three-dimensional virtual object based human-machine interaction method or model training method provided by the method embodiments herein.

In some embodiments, the computer device 1200 may further optionally include: a peripheral interface 1203 and at least one peripheral. The processor 1201, memory 1202, and peripheral interface 1203 may be connected by a bus or signal line. Various peripheral devices may be connected to peripheral interface 1203 via a bus, signal line, or circuit board. Specifically, the peripheral device includes: at least one of radio frequency circuitry 1204, display 1205, camera assembly 1206, audio circuitry 1207, positioning assembly 1208, and power supply 1209.

The peripheral interface 1203 may be used to connect at least one peripheral associated with I/O (Input/Output) to the processor 1201 and the memory 1202. In some embodiments, the processor 1201, memory 1202, and peripheral interface 1203 are integrated on the same chip or circuit board; in some other embodiments, any one or two of the processor 1201, the memory 1202 and the peripheral device interface 1203 may be implemented on a separate chip or circuit board, which is not limited in this embodiment.

The Radio Frequency circuit 1204 is used for receiving and transmitting RF (Radio Frequency) signals, also called electromagnetic signals. The radio frequency circuit 1204 communicates with a communication network and other communication devices by electromagnetic signals. The radio frequency circuit 1204 converts an electric signal into an electromagnetic signal to transmit, or converts a received electromagnetic signal into an electric signal. Optionally, the radio frequency circuit 1204 comprises: an antenna system, an RF transceiver, one or more amplifiers, a tuner, an oscillator, a digital signal processor, a codec chipset, a subscriber identity module card, and so forth. The radio frequency circuit 1204 may communicate with other terminals through at least one wireless communication protocol. The wireless communication protocols include, but are not limited to: the world wide web, metropolitan area networks, intranets, generations of mobile communication networks (2G, 3G, 4G, and 5G), Wireless local area networks, and/or WiFi (Wireless Fidelity) networks. In some embodiments, the rf circuit 1204 may further include NFC (Near Field Communication) related circuits, which are not limited in this application.

The display screen 1205 is used to display a UI (User Interface). The UI may include graphics, text, icons, video, and any combination thereof. When the display screen 1205 is a touch display screen, the display screen 1205 also has the ability to acquire touch signals on or over the surface of the display screen 1205. The touch signal may be input to the processor 1201 as a control signal for processing. At this point, the display 1205 may also be used to provide virtual buttons and/or a virtual keyboard, also referred to as soft buttons and/or a soft keyboard. In some embodiments, the display 1205 may be one, disposed on the front panel of the computer device 1200; in other embodiments, the display 1205 may be at least two, respectively disposed on different surfaces of the computer device 1200 or in a folded design; in other embodiments, the display 1205 may be a flexible display disposed on a curved surface or on a folded surface of the computer device 1200. Even further, the display screen 1205 may be arranged in a non-rectangular irregular figure, i.e., a shaped screen. The Display panel 1205 can be made of LCD (Liquid Crystal Display), OLED (Organic Light-Emitting Diode), or other materials.

Camera assembly 1206 is used to capture images or video. Optionally, camera assembly 1206 includes a front camera and a rear camera. Generally, a front camera is disposed at a front panel of the terminal, and a rear camera is disposed at a rear surface of the terminal. In some embodiments, the number of the rear cameras is at least two, and each rear camera is any one of a main camera, a depth-of-field camera, a wide-angle camera and a telephoto camera, so that the main camera and the depth-of-field camera are fused to realize a background blurring function, and the main camera and the wide-angle camera are fused to realize panoramic shooting and VR (Virtual Reality) shooting functions or other fusion shooting functions. In some embodiments, camera assembly 1206 may also include a flash. The flash lamp can be a monochrome temperature flash lamp or a bicolor temperature flash lamp. The double-color-temperature flash lamp is a combination of a warm-light flash lamp and a cold-light flash lamp, and can be used for light compensation at different color temperatures.

The audio circuitry 1207 may include a microphone and a speaker. The microphone is used for collecting sound waves of a user and the environment, converting the sound waves into electric signals, and inputting the electric signals into the processor 1201 for processing or inputting the electric signals into the radio frequency circuit 1204 to achieve voice communication. For stereo capture or noise reduction purposes, the microphones may be multiple and located at different locations on the computer device 1200. The microphone may also be an array microphone or an omni-directional pick-up microphone. The speaker is used to convert electrical signals from the processor 1201 or the radio frequency circuit 1204 into sound waves. The loudspeaker can be a traditional film loudspeaker or a piezoelectric ceramic loudspeaker. When the speaker is a piezoelectric ceramic speaker, the speaker can be used for purposes such as converting an electric signal into a sound wave audible to a human being, or converting an electric signal into a sound wave inaudible to a human being to measure a distance. In some embodiments, the audio circuitry 1207 may also include a headphone jack.

The Location component 1208 is used to locate a current geographic Location of the computer device 1200 for navigation or LBS (Location Based Service). The Positioning component 1208 can be a Positioning component based on the Global Positioning System (GPS) in the united states, the beidou System in china, or the galileo System in russia.

The power supply 1209 is used to power the various components in the computer device 1200. The power source 1209 may be alternating current, direct current, disposable or rechargeable. When the power source 1209 includes a rechargeable battery, the rechargeable battery may be a wired rechargeable battery or a wireless rechargeable battery. The wired rechargeable battery is a battery charged through a wired line, and the wireless rechargeable battery is a battery charged through a wireless coil. The rechargeable battery may also be used to support fast charge technology.

In some embodiments, the computer device 1200 also includes one or more sensors 1210. The one or more sensors 1210 include, but are not limited to: acceleration sensor 1211, gyro sensor 1212, pressure sensor 1213, fingerprint sensor 1214, optical sensor 1215, and proximity sensor 1216.

The acceleration sensor 1211 may detect magnitudes of accelerations on three coordinate axes of a coordinate system established with the computer apparatus 1200. For example, the acceleration sensor 1211 may be used to detect components of the gravitational acceleration in three coordinate axes. The processor 1201 may control the display screen 1205 to display the user interface in a landscape view or a portrait view according to the gravitational acceleration signal collected by the acceleration sensor 1211. The acceleration sensor 1211 may also be used for acquisition of motion data of a game or a user.

The gyro sensor 1212 may detect a body direction and a rotation angle of the computer device 1200, and the gyro sensor 1212 may collect a 3D motion of the user on the computer device 1200 in cooperation with the acceleration sensor 1211. The processor 1201 can implement the following functions according to the data collected by the gyro sensor 1212: motion sensing (such as changing the UI according to a user's tilting operation), image stabilization at the time of photographing, game control, and inertial navigation.

Pressure sensors 1213 may be disposed on the side bezel of computer device 1200 and/or underlying display 1205. When the pressure sensor 1213 is disposed on the side frame of the computer device 1200, the holding signal of the user to the computer device 1200 can be detected, and the processor 1201 performs left-right hand recognition or quick operation according to the holding signal acquired by the pressure sensor 1213. When the pressure sensor 1213 is disposed at a lower layer of the display screen 1205, the processor 1201 controls the operability control on the UI interface according to the pressure operation of the user on the display screen 1205. The operability control comprises at least one of a button control, a scroll bar control, an icon control and a menu control.

The fingerprint sensor 1214 is used for collecting a fingerprint of the user, and the processor 1201 identifies the user according to the fingerprint collected by the fingerprint sensor 1214, or the fingerprint sensor 1214 identifies the user according to the collected fingerprint. When the user identity is identified as a trusted identity, the processor 1201 authorizes the user to perform relevant sensitive operations, including unlocking a screen, viewing encrypted information, downloading software, paying, changing settings, and the like. The fingerprint sensor 1214 may be disposed on the front, back, or side of the computer device 1200. When a physical key or vendor Logo is provided on the computer device 1200, the fingerprint sensor 1214 may be integrated with the physical key or vendor Logo.

The optical sensor 1215 is used to collect the ambient light intensity. In one embodiment, the processor 1201 may control the display brightness of the display 1205 according to the ambient light intensity collected by the optical sensor 1215. Specifically, when the ambient light intensity is high, the display luminance of the display panel 1205 is increased; when the ambient light intensity is low, the display brightness of the display panel 1205 is turned down. In another embodiment, processor 1201 may also dynamically adjust the camera head 1206 shooting parameters based on the ambient light intensity collected by optical sensor 1215.

A proximity sensor 1216, also called a distance sensor, is generally provided on a front panel of the computer apparatus 1200. The proximity sensor 1216 is used to collect the distance between the user and the front of the computer device 1200. In one embodiment, the processor 1201 controls the display screen 1205 to switch from the bright screen state to the dark screen state when the proximity sensor 1216 detects that the distance between the user and the front of the computer device 1200 is gradually decreasing; when the proximity sensor 1216 detects that the distance between the user and the front of the computer device 1200 is gradually increased, the display 1205 is controlled by the processor 1201 to switch from the rest state to the bright state.

Those skilled in the art will appreciate that the configuration shown in FIG. 12 is not intended to be limiting of the computer device 1200 and may include more or fewer components than those shown, or some components may be combined, or a different arrangement of components may be used.

Fig. 13 is a schematic structural diagram of a computer device according to an embodiment of the present application. Taking a computer device as an example, the server 1300 may generate a large difference due to different configurations or performances, and may include one or more processors (CPUs) 1301 and one or more memories 1302, where the memory 1302 stores at least one program code, and the at least one program code is loaded and executed by the processor 1301 to implement the three-dimensional virtual object-based human-computer interaction method or the model training method provided by the above method embodiments. Of course, the server may also have components such as a wired or wireless network interface, a keyboard, and an input/output interface, so as to perform input/output, and the server may also include other components for implementing the functions of the device, which are not described herein again.

In an exemplary embodiment, a computer-readable storage medium, such as a memory including a program code, which is executable by a processor in a terminal to perform the three-dimensional virtual object based human-machine interaction method or the model training method in the above embodiments, is also provided. For example, the computer-readable storage medium may be a Read-Only Memory (ROM), a Random Access Memory (RAM), a Compact Disc Read-Only Memory (CD-ROM), a magnetic tape, a floppy disk, an optical data storage device, and the like.

In an exemplary embodiment, a computer program product or a computer program is also provided, the computer program product or the computer program comprising computer program code, the computer program code being stored in a computer-readable storage medium, the computer program code being read by a processor of a computer device from the computer-readable storage medium, the computer program code being executed by the processor to cause the computer device to perform the above-mentioned three-dimensional virtual object based human-machine interaction method.

In some embodiments, the computer program according to the embodiments of the present application may be deployed to be executed on one computer device or on multiple computer devices located at one site, or may be executed on multiple computer devices distributed at multiple sites and interconnected by a communication network, and the multiple computer devices distributed at the multiple sites and interconnected by the communication network may constitute a block chain system.

It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program instructing relevant hardware, where the program may be stored in a computer-readable storage medium, and the above-mentioned storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.

The above description is only exemplary of the present application and should not be taken as limiting, as any modification, equivalent replacement, or improvement made within the spirit and principle of the present application should be included in the protection scope of the present application.

Claims

1. A human-computer interaction method based on a three-dimensional virtual object is characterized by comprising the following steps:

acquiring voice data;

2. The method of claim 1, wherein the audio encoder comprises M coding blocks connected in series, wherein each two coding blocks are followed by a pooling layer, and wherein an mth coding block of the M coding blocks is followed by a convolutional layer, and wherein M is an odd number.

3. The method of claim 2, wherein the deep learning network-based audio encoder feature-encodes the speech data to obtain a first audio feature, and comprises:

4. The method of claim 2, wherein the M coding blocks have the same structure, each coding block includes an N-dimensional convolutional layer, a batch normalization layer, and an activation function, and N is a positive integer.

5. The method of claim 1, wherein the number of convolution kernels for the last convolution layer is a product of the number of joints of the three-dimensional virtual object and a dimension of the pose data.

6. The method of claim 1, wherein the pose data is in a six-dimensional rotational representation; the driving the three-dimensional virtual object to execute corresponding actions based on the posture data of each joint comprises the following steps:

7. The method of claim 6, wherein transforming the six-dimensional rotation representation data of each joint into a rotation matrix form comprises:

8. The method of claim 1, wherein the obtaining voice data comprises:

taking original audio as the voice data; the characteristic length of the first audio characteristic is the same as the frame number of the original audio; or the like, or, alternatively,

performing audio feature extraction on the original audio to obtain a second audio feature; using the second audio feature as the voice data; the feature length of the first audio feature is the same as the number of frames of the second audio feature.

9. A method of model training, the method comprising:

10. The method of claim 9, wherein the obtaining predicted position coordinates for the respective joints based on the predicted pose data for the respective joints comprises:

11. The method of claim 10, wherein obtaining the predicted position coordinates of each joint based on the second rotation matrix of each joint comprises:

acquiring the displacement of the ith joint relative to the father joint;

12. A human-computer interaction device based on a three-dimensional virtual object, which is characterized by comprising:

a first acquisition module configured to acquire voice data;

13. A model training apparatus, the apparatus comprising:

a training module configured to construct a first loss function based on the standard pose data and the predicted pose data; constructing a second loss function based on the standard position coordinates and the predicted position coordinates of each joint; and carrying out model training based on the first loss function and the second loss function to obtain a deep learning network.

14. A computer device, characterized in that it comprises a processor and a memory, in which at least one program code is stored, which is loaded and executed by the processor to implement the method of human-computer interaction based on three-dimensional virtual objects according to any one of claims 1 to 8; or, the model training method of any one of claims 9 to 11.

15. A computer-readable storage medium, wherein at least one program code is stored in the storage medium, and the at least one program code is loaded and executed by a processor to implement the three-dimensional virtual object-based human-computer interaction method according to any one of claims 1 to 8; or, the model training method of any one of claims 9 to 11.