CN116993868A

CN116993868A - Animation generation method, device, electronic equipment and storage medium

Info

Publication number: CN116993868A
Application number: CN202210592412.3A
Authority: CN
Inventors: 施展
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2022-05-27
Filing date: 2022-05-27
Publication date: 2023-11-03

Abstract

The application provides an animation generation method, an animation generation device, electronic equipment and a storage medium, which are applied to a game engine, wherein the method comprises the following steps: displaying a voice acquisition function item for acquiring a voice file and a skeleton selection function item for selecting a skeleton model in an animation generation interface of a game engine; acquiring a target voice file based on the voice acquisition function item; in response to a bone model selection operation triggered based on the bone selection function item, determining the selected bone model as a target bone model for generating an animation; and generating a target animation in response to an animation generation instruction triggered based on the target voice file and the target skeleton model, wherein the target animation is used for displaying the process of outputting voice contents in the target voice file by the virtual object corresponding to the target skeleton model. The animation generation efficiency can be improved through the application.

Description

Animation generation method, device, electronic equipment and storage medium

Technical Field

The present application relates to the field of internet technologies, and in particular, to an animation generating method, an animation generating device, an electronic device, and a computer readable storage medium.

Background

With the rapid development of computer technology, the quality of electronic games, video and other production is receiving more and more attention, and the synthesis of three-dimensional anthropomorphic expression animation with high reality is one of important technical targets in the related fields. However, when the related technical scheme is used for making the animation expression, the art producer is required to firstly make the mouth shape animation corresponding to the voice in the DCC (Digtal Content Creator) software and then export the related data, and then import the related data into the real-time rendering engine (such as the illusion engine), so that the three-dimensional anthropomorphic expression animation is generated through the corresponding engine, the process is long, and the animation production efficiency is reduced.

Disclosure of Invention

The embodiment of the application provides an animation generation method, an animation generation device, electronic equipment, a computer readable storage medium and a computer program product, which can improve the animation generation efficiency.

The technical scheme of the embodiment of the application is realized as follows:

the embodiment of the application provides an animation generation method, which is applied to a game engine and comprises the following steps:

displaying a voice acquisition function item for acquiring a voice file and a skeleton selection function item for selecting a skeleton model in an animation generation interface of a game engine;

Acquiring a target voice file based on the voice acquisition function item;

in response to a bone model selection operation triggered based on the bone selection function item, determining the selected bone model as a target bone model for generating an animation;

and generating a target animation in response to an animation generation instruction triggered based on the target voice file and the target skeleton model, wherein the target animation is used for displaying a process of outputting voice contents in the target voice file by a virtual object corresponding to the target skeleton model.

An embodiment of the present application provides an animation generating apparatus including: .

The display module is used for displaying a voice acquisition function item for acquiring a voice file and a skeleton selection function item for selecting a skeleton model in an animation generation interface of the game engine;

the first selection module is used for acquiring a target voice file based on the voice acquisition function item;

a second selection module for determining the selected bone model as a target bone model for generating an animation in response to a bone model selection operation triggered based on the bone selection function item;

and the generating module is used for responding to an animation generating instruction triggered based on the target voice file and the target skeleton model to generate a target animation, wherein the target animation is used for displaying the process of outputting voice contents in the target voice file by a virtual object corresponding to the target skeleton model.

In the above scheme, when the voice acquisition function item is a selection control for selecting a voice file, the first selection module is further configured to respond to a triggering operation for the selection control to display at least one candidate voice file for selection; and responding to the selection operation of a target candidate voice file in the at least one candidate voice file, and taking the acquired target candidate voice file as the target voice file.

In the above scheme, the first selection module is further configured to display a playing progress bar of the target candidate voice file and a corresponding content interception control in response to a selection operation for the target candidate voice file in the at least one candidate voice file; based on the playing progress bar of the target candidate voice file and the corresponding content interception control, responding to voice interception operation for the target candidate voice file, and determining an intercepted audio fragment; and responding to a determining instruction triggered based on the audio fragment, and acquiring the audio fragment as the target voice file.

In the above scheme, the first selection module is further configured to, when the voice acquisition function item is a recording control for recording a voice file, respond to a triggering operation for the recording control, and display a voice recording interface; and receiving a voice file recorded based on the voice recording interface, and taking the recorded voice file as the target voice file.

In the above scheme, the device further comprises an operation module and a play module, wherein the operation module is used for displaying an animation generation function item for making the target animation and a picture of a virtual scene in an operation interface of a game engine; responsive to a trigger operation for the animation generation function item, presenting the animation generation interface; and the playing module is used for playing the target animation in the picture of the virtual scene.

In the above aspect, the second selection module is further configured to present a bone selection interface in response to a triggering operation for the bone selection function item, and display at least one bone model and model introduction information of each bone model in the bone selection interface; based on the bone model and the model introduction information presented in the bone selection interface, the selected bone model is determined as a target bone model in response to a bone model selection operation.

In the above aspect, the animation generation interface further displays a frame rate selection function item for animation frame rate selection, and the apparatus further includes a third selection module for determining the selected animation frame rate as a target animation frame rate for generating an animation in response to an animation frame rate selection operation triggered based on the frame rate selection function item; the generating module is further used for responding to an animation generating instruction triggered based on the target voice file, the target skeleton model and the target animation frame rate to generate a target animation.

In the above scheme, the generating module is further configured to send a control data generating request carrying the target voice file to a server in response to an animation generating instruction triggered based on the target voice file and the target skeleton model; the control data generation request is used for generating control data corresponding to voice data of the target voice file based on the target voice file by the server; wherein the control data is data for controlling each part in the bone model; and receiving the control data returned by the server, and generating a target animation corresponding to the target skeleton model based on the control data.

In the above scheme, the generating module is further configured to respond to an animation generating instruction triggered based on the target voice file and the target skeleton model, and perform data mapping on voice data of the target voice file through a voice driving model to obtain control data corresponding to the target voice file, where the control data is data for controlling each part in the skeleton model; and generating a target animation corresponding to the target skeleton model based on the control data.

In the above scheme, the generating module is further configured to perform feature extraction on the voice data of the target voice file through the feature extraction layer of the voice driving model, so as to obtain audio features of the target voice file at each time point; and carrying out data mapping on the audio characteristics of the target voice file at each time point through a data mapping layer of the voice driving model to obtain control data of the target voice file at each time point.

In the above scheme, the device further comprises a training module, wherein the training module is used for acquiring an initial voice driving model and a training voice file carrying a label, and the label is used for indicating real control data corresponding to the training voice file; performing data mapping on the voice data of the training voice file through the initial voice driving model to obtain prediction control data corresponding to the training voice file; and acquiring the difference between the real control data and the predicted control data, and updating the model parameters of the initial voice driving model based on the difference.

In the above scheme, the training module is further configured to obtain an initial voice driving model and a training voice file carrying a tag, where the tag is used to indicate real control data corresponding to the training voice file; performing data mapping on the voice data of the training voice file through the initial voice driving model to obtain prediction control data corresponding to the training voice file; and acquiring the difference between the real control data and the predicted control data, and updating the model parameters of the initial voice driving model based on the difference.

In the above scheme, the training module is further configured to obtain target text data, and record audio data and expression data when the target object reads the target text data; performing data conversion on the expression data to obtain control data corresponding to the expression data; and taking the file corresponding to the audio data as the training voice file, and taking the control data corresponding to the expression data as the label of the file corresponding to the audio data.

An embodiment of the present application provides an electronic device, including:

a memory for storing executable instructions;

and the processor is used for realizing the animation generation method provided by the embodiment of the application when executing the executable instructions stored in the memory.

The embodiment of the application provides a computer readable storage medium which stores executable instructions for causing a processor to execute the animation generation method provided by the embodiment of the application.

Embodiments of the present application provide a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the electronic device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, so that the electronic device executes the animation generation method provided by the embodiment of the application.

The embodiment of the application has the following beneficial effects:

and determining a target voice file and a target skeleton model through voice acquisition function items and skeleton selection function items presented in an animation generation interface of the game engine, so as to generate a target animation for showing the process of outputting voice contents in the target voice file by a virtual object corresponding to the target skeleton model. Thus, the target animation is directly generated in the game engine through the target voice file, and the efficiency of generating the animation in the game engine is improved.

Drawings

FIG. 1 is a schematic diagram of an architecture of an animation generation system 100 provided in an embodiment of the present application;

fig. 2 is a schematic structural diagram of an electronic device according to an embodiment of the present application;

FIG. 3 is a flow chart of an animation generation method according to an embodiment of the present application;

FIG. 4 is a schematic diagram of an animation generation interface of a game engine provided by an embodiment of the present application;

FIG. 5 is a schematic diagram of an operational interface of a game engine provided by an embodiment of the present application;

FIG. 6 is a schematic illustration of a bone selection interface provided by an embodiment of the present application;

FIG. 7 is a schematic diagram of an animation generation interface provided by an embodiment of the present application;

FIG. 8 is a schematic diagram of a frame rate selection interface provided by an embodiment of the present application;

FIG. 9 is a schematic diagram of rendering a target animation provided by an embodiment of the present application;

FIG. 10 is a flow chart of a process for generating a target animation according to an embodiment of the present application;

FIG. 11 is a flowchart of a process for generating a target animation according to an embodiment of the present application;

FIG. 12 is a flow chart of a training process of a speech driven model according to an embodiment of the present application;

FIG. 13 is a flow chart of an animation generation method according to an embodiment of the present application;

FIG. 14 is a schematic diagram of a generated animation file provided by an embodiment of the present application;

FIG. 15 is a schematic diagram of an animation generation method provided by an embodiment of the present application;

fig. 16 is a schematic diagram of a part of a controller list provided in an embodiment of the present application.

Detailed Description

The present application will be further described in detail with reference to the accompanying drawings, for the purpose of making the objects, technical solutions and advantages of the present application more apparent, and the described embodiments should not be construed as limiting the present application, and all other embodiments obtained by those skilled in the art without making any inventive effort are within the scope of the present application.

In the following description, reference is made to "some embodiments" which describe a subset of all possible embodiments, but it is to be understood that "some embodiments" can be the same subset or different subsets of all possible embodiments and can be combined with one another without conflict.

In the following description, the terms "first", "second", "third" and the like are merely used to distinguish similar objects and do not represent a specific ordering of the objects, it being understood that the "first", "second", "third" may be interchanged with a specific order or sequence, as permitted, to enable embodiments of the application described herein to be practiced otherwise than as illustrated or described herein.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used herein is for the purpose of describing embodiments of the application only and is not intended to be limiting of the application.

Before describing embodiments of the present application in further detail, the terms and terminology involved in the embodiments of the present application will be described, and the terms and terminology involved in the embodiments of the present application will be used in the following explanation.

1) Client (Client): the client is also called a user terminal, which refers to a program corresponding to a server and providing local service for a user, and besides some application programs which can only run locally, the program is generally installed on a common client and needs to cooperate with the server to run, that is, a corresponding server and service programs in a network are needed to provide corresponding service, so that a specific communication connection needs to be established between the client and the server terminal to ensure the normal running of the application programs, for example, a virtual scene client (such as a game client).

2) The game engine refers to the core components of some compiled editable computer game systems or some interactive real-time image applications. These systems provide game designers with the various tools required to write games in order to allow the game designer to easily and quickly make game programs without starting from zero. Most support a variety of operating platforms such as Linux, mac OS X, microsoft Windows. The game engine comprises the following systems: rendering engines (i.e., renderers, including two-dimensional and three-dimensional image engines), physics engines, collision detection systems, sound effects, script engines, computer animations, artificial intelligence, network engines, and scene management.

3) The MFCC (Mel-frequency cepstral coefficients, mel frequency cepstrum coefficient) is a frequency spectrum characteristic for extracting voice data characteristics and reducing operation dimension, mel frequency is proposed based on human ear auditory characteristics and forms a nonlinear corresponding relation with Hertz frequency, and Mel frequency cepstrum coefficient is a Hertz frequency spectrum characteristic obtained by calculation by utilizing the corresponding relation between Mel frequency and Hertz frequency.

4) The illusion Engine (UE) is a game Engine tool for developing real-time technology, and is widely used in 3D modeling rendering and game development at present, and can be used for developing hand-play, end-play and game machine-side games.

5) Anim Sequence, animation data format in UE.

6) And the controller is used for operating the animation mode in the face binding, and an animator can operate the controller to produce the face animation.

7) DCC (Digtal Content Creator), a generic term for producing software for making digital content. Typical software today is such as Maya, blender, houdini, etc.

8) Virtual scenes, namely, a scene which is output by equipment and is different from the real world, can form visual perception of the virtual scenes through naked eyes or the assistance of equipment, for example, a two-dimensional image output by a display screen, and a three-dimensional image output by three-dimensional display technologies such as three-dimensional projection, virtual reality and augmented reality technologies; in addition, various simulated real world sensations such as auditory sensations, tactile sensations, olfactory sensations, and motion sensations can also be formed by various possible hardware.

9) In response to a condition or state that is used to represent the condition or state upon which the performed operation depends, the performed operation or operations may be in real-time or with a set delay when the condition or state upon which it depends is satisfied; without being specifically described, there is no limitation in the execution sequence of the plurality of operations performed.

Referring to fig. 1, fig. 1 is a schematic architecture diagram of an animation generation system 100 according to an embodiment of the present application, in order to support an exemplary application, an animation generation client 401 (i.e. a game engine) is disposed on a terminal 400, and the terminal 400 is connected to a server 200 through a network 300, where the network 300 may be a wide area network or a local area network, or a combination of the two, and uses a wireless or wired link to implement data transmission.

Wherein, the terminal 400 is used for displaying a voice acquisition function item for voice file selection and a bone selection function item for bone model selection in an animation generation interface of the game engine; acquiring a target voice file based on the voice acquisition function item; in response to a bone model selection operation triggered based on the bone selection function item, determining the selected bone model as a target bone model for generating an animation; and generating a target animation in response to an animation generation instruction triggered based on the target voice file and the target skeleton model, wherein the target animation is used for displaying the process of outputting voice contents in the target voice file by the virtual object corresponding to the target skeleton model.

In some embodiments, the server 200 may be a stand-alone physical server, a server cluster or a distributed system formed by a plurality of physical servers, or may be a cloud server that provides cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, content delivery networks (CDNs, content Deliver Network), and basic cloud computing services such as big data and artificial intelligence platforms. The terminal 400 may be, but is not limited to, a smart phone, a tablet computer, a notebook computer, a desktop computer, a set-top box, a smart voice interaction device, a smart home appliance, a car terminal, an aircraft, a mobile device (e.g., a mobile phone, a portable music player, a personal digital assistant, a dedicated messaging device, a portable game device, a smart speaker, and a smart watch), etc. The terminal device and the server may be directly or indirectly connected through wired or wireless communication, which is not limited in the embodiment of the present application.

Next, an electronic device implementing the animation generation method provided by the embodiment of the present application will be described. Referring to fig. 2, fig. 2 is a schematic structural diagram of an electronic device according to an embodiment of the present application, where the electronic device is taken as an example of a terminal shown in fig. 1, and the electronic device shown in fig. 2 includes: at least one processor 410, a memory 450, at least one network interface 420, and a user interface 430. The various components in terminal 400 are coupled together by a bus system 440. It is understood that the bus system 440 is used to enable connected communication between these components. The bus system 440 includes a power bus, a control bus, and a status signal bus in addition to the data bus. But for clarity of illustration the various buses are labeled in fig. 2 as bus system 440.

The processor 410 may be an integrated circuit chip having signal processing capabilities such as a general purpose processor, such as a microprocessor or any conventional processor, or the like, a digital signal processor (DSP, digital Signal Processor), or other programmable logic device, discrete gate or transistor logic, discrete hardware components, or the like.

The user interface 430 includes one or more output devices 431, including one or more speakers and/or one or more visual displays, that enable presentation of the media content. The user interface 430 also includes one or more input devices 432, including user interface components that facilitate user input, such as a keyboard, mouse, microphone, touch screen display, camera, other input buttons and controls.

Memory 450 may be removable, non-removable, or a combination thereof. Exemplary hardware devices include solid state memory, hard drives, optical drives, and the like. Memory 450 optionally includes one or more storage devices physically remote from processor 410.

Memory 450 includes volatile memory or nonvolatile memory, and may also include both volatile and nonvolatile memory. The nonvolatile Memory may be a Read Only Memory (ROM), and the volatile Memory may be a random access Memory (RAM, random Access Memory). The memory 450 described in embodiments of the present application is intended to comprise any suitable type of memory.

In some embodiments, memory 450 is capable of storing data to support various operations, examples of which include programs, modules and data structures, or subsets or supersets thereof, as exemplified below.

An operating system 451 including system programs, e.g., framework layer, core library layer, driver layer, etc., for handling various basic system services and performing hardware-related tasks, for implementing various basic services and handling hardware-based tasks;

a network communication module 452 for accessing other electronic devices via one or more (wired or wireless) network interfaces 420, the exemplary network interface 420 comprising: bluetooth, wireless compatibility authentication (WiFi), and universal serial bus (USB, universal Serial Bus), etc.;

A presentation module 453 for enabling presentation of information (e.g., a user interface for operating peripheral devices and displaying content and information) via one or more output devices 431 (e.g., a display screen, speakers, etc.) associated with the user interface 430;

an input processing module 454 for detecting one or more user inputs or interactions from one of the one or more input devices 432 and translating the detected inputs or interactions.

In some embodiments, the apparatus provided in the embodiments of the present application may be implemented in software, and fig. 2 shows an animation generating apparatus 455 stored in a memory 450, which may be software in the form of a program, a plug-in, or the like, including the following software modules: the display module 4551, the first selection module 4552, the second selection module 4553 and the generation module 4554 are logical, and thus may be arbitrarily combined or further split according to the functions implemented. The functions of the respective modules will be described hereinafter.

In other embodiments, the apparatus provided by the embodiments of the present application may be implemented in hardware, and the animation generating apparatus provided by the embodiments of the present application may be a processor in the form of a hardware decoding processor that is programmed to perform the animation generating method provided by the embodiments of the present application, for example, the processor in the form of a hardware decoding processor may employ one or more application specific integrated circuits (ASIC, application Specific Integrated Circuit), DSPs, programmable logic devices (PLD, programmable Logic Device), complex programmable logic devices (CPLD, complex Programmable Logic Device), field programmable gate arrays (FPGA, field-Programmable Gate Array), or other electronic components.

In some embodiments, the terminal or the server may implement the animation generation method provided by the embodiment of the present application by running a computer program. For example, the computer program may be a native program or a software module in an operating system; the Application program can be a local (Native) Application program (APP), namely a program which can be installed in an operating system to run, such as an instant messaging APP and a web browser APP; the method can also be an applet, namely a program which can be run only by being downloaded into a browser environment; but also an applet that can be embedded in any APP. In general, the computer programs described above may be any form of application, module or plug-in.

Based on the above description of the animation generation system and the electronic device provided by the embodiment of the present application, the animation generation method provided by the embodiment of the present application is described below. In practical implementation, the animation generation method provided by the embodiment of the present application may be implemented by a terminal or a server alone, or implemented by the terminal and the server cooperatively, and the animation generation method provided by the embodiment of the present application is illustrated by the terminal 400 in fig. 1 alone. Referring to fig. 3, fig. 3 is a flowchart of an animation generation method according to an embodiment of the present application, and will be described with reference to the steps shown in fig. 3.

Step 101, the terminal displays a voice acquisition function item for acquiring a voice file and a skeleton selection function item for selecting a skeleton model in an animation generation interface of a game engine.

In actual implementation, the terminal is provided with a game engine that may be used for a user (such as a game planner) to generate target animations, e.g., unrealEngine, unity, cryEngine, etc.

Referring to fig. 4, fig. 4 is a schematic diagram of an animation generation interface of a game engine according to an embodiment of the present application, based on fig. 4, a speech acquisition function item for speech file acquisition is shown in a dashed box 401 in the animation generation interface of the game engine, and a bone selection function item for bone model selection is shown in the dashed box 402.

In actual implementation, the animation generation interface of the game engine is triggered by the animation generation function item in the game operation interface, specifically, the terminal responds to click operation triggered by the client corresponding to the game engine, and presents the operation interface of the game engine; displaying animation generation function items for producing target animations and pictures of virtual scenes in an operation interface of a game engine; responding to the triggering operation for the animation generation function item, and presenting an animation generation interface; meanwhile, after the target animation is generated, the target animation is played in the picture of the virtual scene.

For example, referring to fig. 5, fig. 5 is a schematic diagram of an operation interface of a game engine provided in an embodiment of the present application, based on fig. 5, in which an animation generation function item for producing a target animation is shown in a dashed box 501, the animation generation interface shown in fig. 4 is presented in response to a trigger operation for the animation generation function item, and at the same time, after the target animation is generated, the target animation is played in a screen of a virtual scene.

The trigger operation for the animation generation function item may be a click operation, such as a single click or a double click operation, for the animation generation function item, or may be a voice input for animation generation, such as "animation generation", so that the animation generation function item is triggered based on the voice.

Step 102, acquiring a target voice file based on the voice acquisition function item.

In practical implementation, for the difference of the obtained target voice file, at least two modes of obtaining the target voice file exist based on the voice file obtaining operation triggered by the voice obtaining function item, and the mode of obtaining the target voice file is exemplified next.

In some embodiments, when the obtained target voice file is a pre-stored voice file, based on the voice obtaining function item, the process of obtaining the target voice file specifically includes, when the voice obtaining function item is a selection control for voice file selection, responding to a triggering operation for the selection control, and displaying at least one candidate voice file for selection; and responding to the selection operation of the target candidate voice file in the at least one candidate voice file, and taking the acquired target candidate voice file as the target voice file. When the voice obtaining function item is a selection control for selecting a voice file, the terminal presents a voice selection interface including a plurality of candidate voice files in response to a triggering operation for the selection control after receiving the triggering operation, such as a clicking operation, of the user for the selection control, and then takes the candidate voice file as a target voice file in response to a selecting operation, such as a clicking operation, of the user for one candidate voice file in the plurality of candidate voice files.

It should be noted that, the manner of selecting the target voice file based on the candidate voice file may directly select the target candidate voice file from at least one candidate voice file, so that the target candidate voice file is taken as the target voice file, and the audio fragment obtained by cutting may be taken as the target voice file, specifically, the playing progress bar and the corresponding content cutting control of the target candidate voice file are displayed in response to the selection operation for the target candidate voice file in the at least one candidate voice file; based on a playing progress bar of the target candidate voice file and a corresponding content interception control, responding to voice interception operation aiming at the target candidate voice file, and determining an intercepted audio fragment; and responding to a determining instruction triggered based on the audio fragment, and taking the acquired audio fragment as a target voice file. For example, when a playing progress bar of the target candidate voice file and a corresponding content interception control are displayed, after receiving a triggering operation such as a zooming operation or a dragging operation of a user on the content interception control, the terminal intercepts a corresponding audio clip based on the playing progress bar of the target candidate voice file in response to the triggering operation on the content interception control, for example, when the triggering operation is the zooming operation, the playing progress bar is zoomed based on double fingers, so that the corresponding audio clip is intercepted; when the triggering operation is a dragging operation, dragging a starting point of the content interception control to a starting point corresponding to the audio fragment, and dragging an ending point of the content interception control to an ending point corresponding to the audio fragment, so that the corresponding audio fragment is intercepted. Therefore, animation synthesis is not needed according to the whole voice file, corresponding animation production is only carried out based on the audio clip, the animation production efficiency is improved, and meanwhile, the resource consumption is reduced.

In other embodiments, when the obtained target voice file is a voice file recorded in real time, the process of obtaining the target voice file based on the voice obtaining function item specifically includes, when the voice obtaining function item is a recording control for recording the voice file, responding to a triggering operation for the recording control, and displaying a voice recording interface; and receiving the voice file recorded based on the voice recording interface, and taking the recorded voice file as a target voice file. Thus, the target voice file is obtained through real-time recording of the user voice, so that corresponding animation expressions are generated, and the animation interactivity and the animation effect are enhanced.

It should be noted that, for the difference of the obtained target voice file, the mode of obtaining the target voice file based on the voice file obtaining operation triggered by the voice obtaining function item includes, but is not limited to, the mode of obtaining the target voice file based on the voice file selecting operation triggered by the voice obtaining function item, which is not limited in the embodiment of the present application.

Step 103, in response to a bone model selection operation triggered based on the bone selection function item, determining the selected bone model as a target bone model for generating an animation.

In actual implementation, determining the selected bone model as a target bone model for generating an animation in response to a bone model selection operation triggered based on a bone selection function item, specifically including presenting a bone selection interface in response to the trigger operation for the bone selection function item, and presenting at least one bone model and model introduction information of each bone model in the bone selection interface; based on the bone model and model introduction information presented in the bone selection interface, the selected bone model is determined as the target bone model in response to the bone model selection operation.

Referring to fig. 6, fig. 6 is a schematic diagram of a bone selection interface provided by an embodiment of the present application, based on fig. 6, in response to a triggering operation for a bone selection function, a bone selection interface including two bone models is presented, and at the same time, introduction information of each model is presented in the bone selection interface. Illustratively, referring to fig. 4, based on fig. 4, after the terminal presents a bone selection interface including introduction information of two candidate bone models and respective candidate bone models as shown in fig. 6 in response to a trigger operation such as a click operation for a bone selection function item in a dashed box 402, the terminal then takes one of the bone models as a target bone model in response to a selection operation such as a click operation for the user to make according to each introduction information.

In some embodiments, the frame rate selection function for frame rate selection of the animation is also displayed in the animation generation interface, referring to fig. 7, fig. 7 is a schematic diagram of the animation generation interface provided in the embodiment of the present application, and based on fig. 7, the frame rate selection function for frame rate selection of the animation is shown in a dashed box 701. In practical application, in response to an animation frame rate selection operation triggered based on the frame rate selection function item, the selected animation frame rate is determined as a target animation frame rate for generating an animation, and then a target animation is generated in response to an animation generation instruction triggered based on a target voice file, a target bone model, and the target animation frame rate.

In actual implementation, a process of determining a selected animation frame rate as a target animation frame rate for generating an animation in response to an animation frame rate selection operation triggered based on a frame rate selection function item, specifically includes, in response to a trigger operation for the frame rate selection function item, presenting a frame rate selection interface including at least one animation frame rate; in response to an animation frame rate selection operation triggered based on the frame rate selection interface, the selected animation frame rate is determined as a target animation frame rate.

Referring to fig. 8, fig. 8 is a schematic diagram of a frame rate selection interface provided by an embodiment of the present application, based on fig. 8, in response to a trigger operation for a frame rate selection function item, a frame rate selection interface including two animation frame rates is presented. For example, referring to fig. 7, based on fig. 7, after the terminal presents a frame rate selection interface including two animation frame rates as shown in fig. 8 in response to a trigger operation such as a click operation for a frame rate selection function item in a dotted line box 701, the terminal then regards the animation as a target animation frame rate in response to a selection operation such as a click operation for one of the animation frame rates by the user.

And 104, generating a target animation in response to an animation generation instruction triggered based on the target voice file and the target skeleton model.

In actual implementation, the target animation is used for showing the process that the virtual object corresponding to the target skeleton model outputs the voice content in the target voice file. Referring to fig. 9, fig. 9 is a schematic diagram of presenting a target animation according to an embodiment of the present application, based on fig. 9, after determining a target voice file and a target skeleton model, an animation generation instruction triggered based on the target voice file and the target skeleton model is received, and then a virtual object corresponding to the target skeleton model is determined based on the animation generation instruction, so that an animation of voice content in the virtual object output target voice file is displayed in a virtual picture.

In some embodiments, the animation generation instruction may be triggered by a confirmation function item, specifically, the animation generation interface further displays a confirmation function item for determining to generate an animation, and the process of generating the target animation in response to the animation generation instruction triggered based on the target voice file and the target skeleton model specifically includes generating the animation generation instruction in response to a triggering operation such as a clicking operation for the confirmation function item; generating the target animation based on the animation generation instruction,

In other embodiments, the animation generation instruction may be triggered by a voice confirmation function item, specifically, the process of generating the target animation in response to the animation generation instruction triggered based on the target voice file and the target bone model includes, specifically, entering the target voice for confirming generation of the animation in response to a triggering operation such as "confirm generation of animation" for the voice confirmation function item; generating an animation generation instruction based on the target voice; and generating the target animation based on the animation generation instruction.

In practical implementation, the process of generating the target animation based on the animation generation instruction may be based on a voice driving model, and in practical application, the voice driving model may be operated on a server or a terminal, and then, the process of generating the target animation based on the animation generation instruction is described with respect to two operation modes of the voice driving model.

In some embodiments, when the voice-driven model runs on the server, referring to fig. 10, fig. 10 is a flowchart illustrating a process of generating the target animation according to an embodiment of the present application, based on fig. 10, step 104 may be performed by:

In step 1041a, a control data generation request carrying the target voice file is sent to the server in response to an animation generation instruction triggered based on the target voice file and the target skeleton model.

The control data generation request is used for generating control data corresponding to the voice data of the target voice file based on the target voice file by the server, wherein the control data is data for controlling all parts in the skeleton model.

In practical implementation, the terminal generates a request for generating control data carrying a target voice file to the server, the server performs data mapping on the voice data of the target voice file through the voice driving model after receiving the request for generating the control data carrying the target voice file, so as to obtain control data corresponding to the voice data of the target voice file, and then the server returns the obtained control data to the terminal.

The process of performing data mapping on the voice data of the target voice file through the voice driving model to obtain control data corresponding to the voice data of the target voice file specifically comprises the steps that the server performs feature extraction on the voice data of the target voice file through a feature extraction layer of the voice driving model to obtain audio features of the target voice file at all time points; and carrying out data mapping on the audio characteristics of the target voice file at each time point through a data mapping layer of the voice driving model to obtain control data of the target voice file at each time point.

In step 1042a, control data returned by the server is received, and a target animation corresponding to the target skeleton model is generated based on the control data.

In other embodiments, when the voice-driven model is running on the terminal, referring to fig. 11, fig. 11 is a flowchart illustrating a process of generating the target animation according to the embodiment of the present application, based on fig. 11, step 104 may be performed by:

step 1041b, responding to the animation generation instruction triggered based on the target voice file and the target skeleton model, and performing data mapping on the voice data of the target voice file through the voice driving model to obtain the control data of the corresponding target voice file.

Wherein the control data is data for controlling each part in the bone model.

In practical implementation, the process of performing data mapping on the voice data of the target voice file through the voice driving model to obtain control data corresponding to the voice data of the target voice file specifically comprises the steps that the terminal performs feature extraction on the voice data of the target voice file through a feature extraction layer of the voice driving model to obtain audio features of the target voice file at all time points; and carrying out data mapping on the audio characteristics of the target voice file at each time point through a data mapping layer of the voice driving model to obtain control data of the target voice file at each time point.

Step 1042b, generating a target animation corresponding to the target bone model based on the control data.

In some embodiments, before displaying the voice acquisition function for voice file acquisition and the bone selection function for bone model selection in the animation generation interface of the game engine, the voice driving model may be trained first, where, taking the voice driving model running on the terminal as an example, referring to fig. 12, fig. 12 is a schematic flow chart of the training process of the voice driving model provided in the embodiment of the present application, based on fig. 12, before step 101, further may be executed:

in step 201, the terminal acquires an initial voice driving model and a training voice file carrying a tag.

The tag is used for indicating real control data corresponding to the training voice file.

In actual implementation, the process of acquiring the training voice file carrying the tag by the terminal specifically comprises the steps of acquiring target text data, and recording audio data and expression data when a target object reads the target text data; performing data conversion on the expression data to obtain control data corresponding to the expression data; and taking the file corresponding to the audio data as a training voice file, and taking the control data corresponding to the expression data as a label of the file corresponding to the corresponding audio data.

As an example, text data is prepared first, then the target object reads the text data, the mouth shape of the target object is converted into corresponding control data through facial motion capture, and meanwhile, the voice of the target object is recorded to obtain an audio file, so that the audio file is used as a training voice file, and the control data is used as a label of the corresponding audio file.

Step 202, performing data mapping on the voice data of the training voice file through the initial voice driving model to obtain the prediction control data of the corresponding training voice file.

In practical implementation, the process of performing data mapping on the voice data of the training voice file through the initial voice driving model to obtain the predictive control data of the corresponding training voice file specifically comprises the steps of firstly performing feature extraction on the voice data of the training voice file through a feature extraction layer of the initial voice driving model to obtain the audio features of the training voice file at all time points; and then, carrying out data mapping on the audio characteristics of the training voice file at each time point through a data mapping layer of the initial voice driving model to obtain the prediction control data of the training voice file at each time point.

And 203, obtaining the difference between the real control data and the predicted control data, and updating the model parameters of the initial voice driving model based on the difference.

In actual implementation, a loss function corresponding to the initial voice driving model is acquired, then differences of real control data and predicted control data are acquired, and the value of the loss function is determined based on the differences, so that model parameters of the initial voice driving model are updated based on the value of the loss function.

The loss function here may be, for example

Where Loss represents a Loss function, m represents a dimension of control data, y represents predictive control data, and y' represents real control data.

By applying the embodiment of the application, the target voice file and the target skeleton model are determined through the voice acquisition function item and the skeleton selection function item which are presented in the animation generation interface of the game engine, so that the target animation for showing the process of outputting the voice content in the target voice file by the virtual object corresponding to the target skeleton model is generated. Thus, the target animation is directly generated in the game engine through the target voice file, and the efficiency of generating the animation in the game engine is improved.

Next, continuing to describe the animation generation method provided by the embodiment of the present application, fig. 13 is a schematic flow chart of the animation generation method provided by the embodiment of the present application, where, taking the example that the voice driving model is run on the server, referring to fig. 13, the animation generation method provided by the embodiment of the present application is cooperatively implemented by the client and the server.

In step 301, the client side responds to the uploading operation for the training voice file to obtain the training voice file carrying the tag.

In practical implementation, the client may be a game engine client set in the terminal and used for generating animation, the training voice file may be triggered by the user based on the man-machine interaction interface of the client, so that the client presents a voice file selection interface on the man-machine interaction interface, and the user uploads the training voice file locally from the terminal based on the voice file selection interface, thereby enabling the client to obtain the uploaded training voice file.

In some embodiments, the training voice file may also be recorded by a recording device and a camera which are in communication connection with the terminal, specifically, the speech text is prepared first, then the speech is read by the user, the facial motion of the user is captured through the camera, the mouth shape of the user is converted into corresponding control data, and simultaneously, the voice when the user reads the speech is recorded by the recording device, so as to obtain an audio file, the audio file is used as the training voice file, and the control data is used as the tag of the corresponding audio file, so as to obtain the training voice file with the tag. And after the training voice file carrying the tag is obtained, transmitting the training voice file to the terminal and automatically uploading the training voice file to the client by the terminal.

In step 302, the client sends the training voice file and the corresponding tag to the server.

In step 303, the server inputs the received training speech file into the speech driven model.

Step 304, output predictive control data for the training speech file.

Step 305, obtain the difference between the predictive control data and the tag, and train the speech driven model based on the difference.

In practical implementation, the server completes training of the voice driving model by iterating the training process until the loss function reaches convergence.

In step 306, the server generates a prompt message that the speech driven model training is complete.

In step 307, the server sends a prompt message to the client.

In step 308, the client responds to the uploading operation of the voice file, and obtains the voice file to be processed.

It should be noted that, the voice file may also be sent to the client by another device in communication with the terminal.

In step 309, the client sends a control data generation request carrying the voice file to the server in response to the animation generation instruction for the voice file.

In practical implementation, the animation generation instruction of the voice file may be sent to the client by other devices in communication connection with the terminal, or may be generated by the user after triggering the corresponding confirmation function item based on a man-machine interaction interface of the client, or may be automatically generated by the client under a certain triggering condition, for example, when the skeletal model of the target animation and the animation frame rate default, the client obtains the voice file and then automatically generates the animation generation instruction for the voice file.

In step 310, the server inputs the received voice file into the voice driving model to obtain control data corresponding to the voice data of the voice file.

In step 311, the server sends control data to the client.

In step 312, the client generates a target animation based on the control data.

In actual implementation, the client may play the target animation in the man-machine interaction interface of the client, or store the target animation locally in the terminal, or send the target animation to other devices connected with the terminal in a communication manner, or the like.

In the following, an exemplary application of the embodiment of the present application in a practical application scenario will be described.

When making an animation expression, an art producer needs to make a mouth shape animation corresponding to voice in DCC software, export the mouth shape animation, and then import the mouth shape animation into a game for a long process, so that animation production efficiency is reduced. Based on this, the present application provides an animation generation method, apparatus, electronic device, computer readable storage medium, and computer program product for generating an animation asset (target animation) directly in a UE (game engine) through voice, transmitting a voice audio file (target voice file) to a voice driving server (voice driving model) through a network request, obtaining data (control data) of a controller animation, and then making the controller animation (control data) into the animation asset (target animation) in the UE, so that the animation asset can be directly generated in the engine, and the efficiency of making the animation is improved.

In practical implementation, the UE plug-in of the present application is provided for an animator to use, as shown in fig. 7, based on fig. 7, by clicking on a function item in a dashed box 702 first, selecting a voice file that wants to generate an animation, clicking on a function item in a dashed box 703, as shown in fig. 6, selecting a Skeleton that needs to generate an animation, clicking on a function item in a dashed box 701 finally, as shown in fig. 8, selecting a frame rate for generating an animation, clicking on a confirm function item for determining to generate an animation, and generating an animation file, as shown in fig. 14, and fig. 14 is a schematic diagram of the generated animation file provided in the embodiment of the present application, based on fig. 14, after selecting a voice file, selecting a Skeleton of "workmask bsig_skeleton" in fig. 6, and selecting a frame rate of 30 in fig. 8, generating an animation file as shown in fig. 14.

Referring to fig. 15, fig. 15 is a schematic diagram of an animation generation method according to an embodiment of the present application, based on fig. 15, the animation generation method according to the present application is implemented by three steps, specifically, the UE reads a voice audio file and sends the voice audio file to the voice driving server, the voice driving server returns controller animation data to the UE, and the UE generates animation assets from the controller animation data.

For the process that the UE reads the voice audio file and sends it to the voice driving server, specifically, first, binary data of the designated voice file is read, and the binary audio file is sent to the voice driving server through a network request such as an http post request.

For the process that the voice driving server returns the controller animation data to the UE, it should be noted that the voice driving server runs a voice driving model, where the voice driving model may convert voice audio into a curve of controllers, where there are a plurality of controllers, and the order of the plurality of controllers is predefined, and when the number of controllers is 80, for example, a part of the controller list is shown in fig. 16, fig. 16 is a schematic diagram of a part of the controller list provided by the embodiment of the present application, based on fig. 16, the controllers in the dashed line box 1601 in fig. 16 are used to control parameters related to eyes, that is, corresponding controller data is used to generate and control eyes of a corresponding person of a skeleton in the animation, and the controllers in the dashed line box 1602 are used to control parameters related to a dimple, that is, corresponding controller data is used to generate and control a dimple of a corresponding person of a skeleton in the animation.

In practical implementation, firstly, text data is prepared, an actor (target object) reads the text data, the mouth shape (tag data) of the actor can be converted into data (control data) of 80 controllers through facial motion capture, and meanwhile, the voice of the actor is recorded to obtain an audio file; then extracting the MFCC characteristics of the audio file and the controller data corresponding to the time, making an input training data pair, namely taking the audio file as a training voice file, and taking the data of 80 controllers as labels of the corresponding audio file to train the model; then, the MFCC features are input into the voice driving model, the values of 80 controllers are obtained through prediction, and the above formula (1) is trained to minimize Loss, that is, to minimize the sum of squares of errors between the predicted values and the actual values of 80 controllers, so that the mapping relationship between the audio features and the controller data is learned, and after the training is completed, the voice data can be mapped into the controller data through the voice driving model. Here, each float corresponds to a controller, and each frame of animation is 80 float values; and finally, the server packages the animation data of all frames into binary data packets and returns the binary data packets to the UE.

For the process that the UE generates the animation asset from the controller animation data, specifically, the UE acquires a binary data packet returned by the voice driving server, before reading the binary data packet and generating the animation asset, all frames and all controllers need to be set, wherein the process of setting all controllers is that the Anim Sequence is created for the selected skeleton, the frame rate of the Anim Sequence is set to be the selected frame rate, then 80 curves are sequentially added in the Anim Sequence, names correspond to the names of 80 controllers, the value of each controller corresponds to the value in 80 float, and the Sequence is the corresponding Sequence in the controller list; the process of setting all frames is to set a key frame in an Anim sequence, that is, to set the time and value of the key frame, where the time of the key frame is the product of the current frame number and the single frame time, and is, for example, 1/60 seconds when the frame rate is 60fps and 1/30 seconds when the frame rate is 30 fps. Based on this, after all frames, all controller values are set, and the data of the binary data packet is thus sequentially read, an Anim Sequence animation asset can be generated, as shown in fig. 9.

Continuing with the description below of an exemplary architecture of animation generation device 455 implemented as a software module provided by an embodiment of the present application, in some embodiments, as shown in FIG. 2, the software modules stored in animation generation device 455 of memory 440 may comprise:

a display module 4551 for displaying, in an animation generation interface of the game engine, a voice acquisition function item for voice file acquisition and a bone selection function item for bone model selection;

a first selection module 4552, configured to obtain a target voice file based on the voice acquisition function item;

a second selection module 4553 for determining the selected bone model as a target bone model for generating an animation in response to a bone model selection operation triggered based on the bone selection function item;

And the generating module 4554 is configured to generate a target animation in response to an animation generation instruction triggered based on the target voice file and the target skeleton model, where the target animation is used for displaying a process that a virtual object corresponding to the target skeleton model outputs voice content in the target voice file.

In some embodiments, the first selecting module 4552 is further configured to, when the voice obtaining function item is a selection control for selecting a voice file, display at least one candidate voice file for selection in response to a triggering operation for the selection control; and responding to the selection operation of a target candidate voice file in the at least one candidate voice file, and taking the acquired target candidate voice file as the target voice file.

In some embodiments, the first selecting module 4552 is further configured to display a playing progress bar and a corresponding content capture control of the target candidate voice file in response to a selection operation for the target candidate voice file in the at least one candidate voice file; based on the playing progress bar of the target candidate voice file and the corresponding content interception control, responding to voice interception operation for the target candidate voice file, and determining an intercepted audio fragment; and responding to a determining instruction triggered based on the audio fragment, and acquiring the audio fragment as the target voice file.

In some embodiments, the first selecting module 4552 is further configured to, when the voice obtaining function item is a recording control for recording a voice file, display a voice recording interface in response to a triggering operation for the recording control; and receiving a voice file recorded based on the voice recording interface, and taking the recorded voice file as the target voice file.

In some embodiments, the apparatus further comprises a running module and a playing module, wherein the running module is used for displaying an animation generation function item for producing the target animation and a picture of a virtual scene in a running interface of a game engine; responsive to a trigger operation for the animation generation function item, presenting the animation generation interface; and the playing module is used for playing the target animation in the picture of the virtual scene.

In some embodiments, the second selecting module 4553 is further configured to present a bone selection interface in response to a triggering operation for the bone selection function item, and present at least one bone model and model introduction information for each of the bone models in the bone selection interface; based on the bone model and the model introduction information presented in the bone selection interface, the selected bone model is determined as a target bone model in response to a bone model selection operation.

In some embodiments, the animation generation interface further displays a frame rate selection function item for animation frame rate selection, and the apparatus further includes a third selection module for determining the selected animation frame rate as a target animation frame rate for generating an animation in response to an animation frame rate selection operation triggered based on the frame rate selection function item; the generating module 4554 is further configured to generate a target animation in response to an animation generation instruction triggered based on the target voice file, the target bone model, and the target animation frame rate.

In some embodiments, the generating module 4554 is further configured to send a control data generation request carrying the target voice file to a server in response to an animation generation instruction triggered based on the target voice file and the target bone model; the control data generation request is used for generating control data corresponding to voice data of the target voice file based on the target voice file by the server; wherein the control data is data for controlling each part in the bone model; and receiving the control data returned by the server, and generating a target animation corresponding to the target skeleton model based on the control data.

In some embodiments, the generating module 4554 is further configured to perform data mapping on the voice data of the target voice file through a voice driving model in response to an animation generating instruction triggered based on the target voice file and the target bone model, to obtain control data corresponding to the target voice file, where the control data is data for controlling each part in the bone model; and generating a target animation corresponding to the target skeleton model based on the control data.

In some embodiments, the generating module 4554 is further configured to perform feature extraction on the voice data of the target voice file through a feature extraction layer of the voice driving model to obtain audio features of the target voice file at each time point; and carrying out data mapping on the audio characteristics of the target voice file at each time point through a data mapping layer of the voice driving model to obtain control data of the target voice file at each time point.

In some embodiments, the apparatus further comprises a training module, configured to obtain an initial voice driving model, and a training voice file carrying a tag, where the tag is configured to indicate real control data corresponding to the training voice file; performing data mapping on the voice data of the training voice file through the initial voice driving model to obtain prediction control data corresponding to the training voice file; and acquiring the difference between the real control data and the predicted control data, and updating the model parameters of the initial voice driving model based on the difference.

In some embodiments, the training module is further configured to obtain an initial voice driving model, and a training voice file carrying a tag, where the tag is used to indicate real control data corresponding to the training voice file; performing data mapping on the voice data of the training voice file through the initial voice driving model to obtain prediction control data corresponding to the training voice file; and acquiring the difference between the real control data and the predicted control data, and updating the model parameters of the initial voice driving model based on the difference.

In some embodiments, the training module is further configured to obtain target text data, and record audio data and expression data when the target object reads the target text data; performing data conversion on the expression data to obtain control data corresponding to the expression data; and taking the file corresponding to the audio data as the training voice file, and taking the control data corresponding to the expression data as the label of the file corresponding to the audio data.

Embodiments of the present application provide a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the electronic device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions, so that the electronic device executes the animation generation method according to the embodiment of the present application.

Embodiments of the present application provide a computer-readable storage medium storing executable instructions that, when executed by a processor, cause the processor to perform a page rendering method provided by embodiments of the present application, for example, an animation generation method as shown in fig. 3.

In some embodiments, the computer readable storage medium may be FRAM, ROM, PROM, EPROM, EEPROM, flash memory, magnetic surface memory, optical disk, or CD-ROM; but may be a variety of devices including one or any combination of the above memories.

In some embodiments, the executable instructions may be in the form of programs, software modules, scripts, or code, written in any form of programming language (including compiled or interpreted languages, or declarative or procedural languages), and they may be deployed in any form, including as stand-alone programs or as modules, components, subroutines, or other units suitable for use in a computing environment.

As an example, the executable instructions may, but need not, correspond to files in a file system, may be stored as part of a file that holds other programs or data, for example, in one or more scripts in a hypertext markup language (HTML, hyper Text Markup Language) document, in a single file dedicated to the program in question, or in multiple coordinated files (e.g., files that store one or more modules, sub-programs, or portions of code).

As an example, executable instructions may be deployed to be executed on one electronic device or on multiple electronic devices located at one site or, alternatively, on multiple electronic devices distributed across multiple sites and interconnected by a communication network.

In summary, the embodiment of the application has the following beneficial effects:

by directly generating the target animation in the game engine through the target voice file, the efficiency of generating the animation in the game engine is improved.

The foregoing is merely exemplary embodiments of the present application and is not intended to limit the scope of the present application. Any modification, equivalent replacement, improvement, etc. made within the spirit and scope of the present application are included in the protection scope of the present application.

Claims

1. An animation generation method, applied to a game engine, comprising:

acquiring a target voice file based on the voice acquisition function item;

2. The method of claim 1, wherein the acquiring the target voice file based on the voice acquisition function item comprises:

when the voice acquisition function item is a selection control for voice file selection, responding to a triggering operation for the selection control, and displaying at least one candidate voice file for selection;

and responding to the selection operation of a target candidate voice file in the at least one candidate voice file, and taking the acquired target candidate voice file as the target voice file.

3. The method of claim 2, wherein the responding to the selection operation for the target candidate voice file in the at least one candidate voice file, the obtaining the target candidate voice file as the target voice file comprises:

responding to the selection operation of the target candidate voice file in the at least one candidate voice file, and displaying the playing progress bar of the target candidate voice file and a corresponding content interception control;

Based on the playing progress bar of the target candidate voice file and the corresponding content interception control, responding to voice interception operation for the target candidate voice file, and determining an intercepted audio fragment;

and responding to a determining instruction triggered based on the audio fragment, and acquiring the audio fragment as the target voice file.

4. The method of claim 1, wherein the acquiring the target voice file based on the voice acquisition function item comprises:

when the voice acquisition function item is a recording control for recording a voice file, responding to triggering operation for the recording control, and displaying a voice recording interface;

and receiving a voice file recorded based on the voice recording interface, and taking the recorded voice file as the target voice file.

5. The method of claim 1, wherein before displaying the voice capture function for voice file capture and the bone selection function for bone model selection in the animation generation interface of the game engine, the method further comprises:

displaying an animation generation function item for producing the target animation and a picture of a virtual scene in an operation interface of a game engine;

Responsive to a trigger operation for the animation generation function item, presenting the animation generation interface;

after the generating the target animation in response to the animation generating instruction triggered based on the target voice file and the target skeleton model, the method further comprises:

and playing the target animation in the picture of the virtual scene.

6. The method of claim 1, wherein the determining the selected bone model as the target bone model for generating the animation in response to the bone model selection operation triggered based on the bone selection function comprises:

responding to the triggering operation for the bone selection function item, presenting a bone selection interface, and displaying at least one bone model and model introduction information of each bone model in the bone selection interface;

based on the bone model and the model introduction information presented in the bone selection interface, the selected bone model is determined as a target bone model in response to a bone model selection operation.

7. The method of claim 1, wherein the animation generation interface further displays a frame rate selection function for animation frame rate selection, the method further comprising:

In response to an animation frame rate selection operation triggered based on the frame rate selection function item, determining the selected animation frame rate as a target animation frame rate for generating an animation;

the generating a target animation in response to an animation generation instruction triggered based on the target voice file and the target skeleton model comprises the following steps:

and generating a target animation in response to an animation generation instruction triggered based on the target voice file, the target skeleton model and the target animation frame rate.

8. The method of claim 1, wherein the generating a target animation in response to animation generation instructions triggered based on the target voice file and the target bone model comprises:

responding to an animation generation instruction triggered based on the target voice file and the target skeleton model, and sending a control data generation request carrying the target voice file to a server;

the control data generation request is used for generating control data corresponding to voice data of the target voice file based on the target voice file by the server;

wherein the control data is data for controlling each part in the bone model;

And receiving the control data returned by the server, and generating a target animation corresponding to the target skeleton model based on the control data.

9. The method of claim 1, wherein the generating a target animation in response to animation generation instructions triggered based on the target voice file and the target bone model comprises:

responding to an animation generation instruction triggered on the basis of the target voice file and the target skeleton model, and performing data mapping on voice data of the target voice file through a voice driving model to obtain control data corresponding to the target voice file, wherein the control data is data for controlling all parts in the skeleton model;

and generating a target animation corresponding to the target skeleton model based on the control data.

10. The method of claim 9, wherein the mapping the voice data of the target voice file to obtain the control data corresponding to the target voice file through the voice driving model comprises:

extracting the characteristics of the voice data of the target voice file through the characteristic extraction layer of the voice driving model to obtain the audio characteristics of the target voice file at each time point;

And carrying out data mapping on the audio characteristics of the target voice file at each time point through a data mapping layer of the voice driving model to obtain control data of the target voice file at each time point.

11. The method of claim 9, wherein before displaying the voice capture function for voice file capture and the bone selection function for bone model selection in the animation generation interface of the game engine, the method further comprises:

acquiring an initial voice driving model and a training voice file carrying a label, wherein the label is used for indicating real control data corresponding to the training voice file;

performing data mapping on the voice data of the training voice file through the initial voice driving model to obtain prediction control data corresponding to the training voice file;

and acquiring the difference between the real control data and the predicted control data, and updating the model parameters of the initial voice driving model based on the difference.

12. The method of claim 11, wherein obtaining a training voice file carrying a tag comprises:

Acquiring target text data, and recording audio data and expression data when a target object reads the target text data;

performing data conversion on the expression data to obtain control data corresponding to the expression data;

and taking the file corresponding to the audio data as the training voice file, and taking the control data corresponding to the expression data as the label of the file corresponding to the audio data.

13. An animation generation device for use with a game engine, the device comprising:

14. An electronic device, comprising:

a memory for storing executable instructions;

a processor for implementing the animation generation method of any of claims 1 to 12 when executing executable instructions stored in the memory.

15. A computer readable storage medium storing executable instructions for causing a processor to perform the animation generation method of any one of claims 1-12.