CN117113968A

CN117113968A - Data processing method, device, computer equipment and storage medium

Info

Publication number: CN117113968A
Application number: CN202310840488.8A
Authority: CN
Inventors: 杜文哲; 陈万顺; 杜楠; 代勇; 程鹏宇; 郑哲; 刘星言
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2023-07-07
Filing date: 2023-07-07
Publication date: 2023-11-24

Abstract

The embodiment of the application discloses a data processing method, a device, computer equipment and a storage medium, which can be applied to an artificial intelligence scene and comprise the following steps: acquiring a text to be predicted associated with a virtual object; respectively acquiring text matching degrees between M action texts in an action library and the text to be predicted, and selecting N action texts corresponding to the text to be predicted from the M action texts based on the M text matching degrees; m is a positive integer; n is a positive integer less than or equal to M; performing action prediction on the virtual object based on the N action texts and the text to be predicted, and generating a predicted action text corresponding to the text to be predicted; predicting that the action text belongs to N action texts; the predicted action text is used for driving the virtual object to execute the action corresponding to the predicted action text. By adopting the embodiment of the application, the prediction cost can be reduced, and the prediction efficiency and the prediction accuracy can be improved.

Description

Data processing method, device, computer equipment and storage medium

Technical Field

The present application relates to the field of computer technologies, and in particular, to a data processing method, a data processing device, a computer device, and a storage medium.

Background

With the development of artificial intelligence technology, virtual objects (e.g., virtual persons) related technologies and applications have also concurrently emerged. The virtual person refers to a virtual character with a digital appearance, and the virtual person has the actions of a person and the ideas of the person, can give feedback of language, expression and actions according to conversation communication, and even does not cause the human form. Currently, when predicting actions of a virtual object based on the text to be predicted of the virtual object (e.g., a context associated with the virtual object), the predicted actions are either heavily dependent on manual labeling or are generated directly based on the text to be predicted.

The prediction mode of manual labeling refers to the fact that labeling personnel are required to label action texts manually according to understanding of the personnel, and the fact that labeling of the same action text by different labeling personnel is inconsistent is unavoidable, so that cost is high, and prediction efficiency and prediction accuracy are reduced. The prediction mode directly generating the predicted action based on the text description of the text to be predicted not only requires the text description to be accurate enough, but also requires a great amount of training data to be collected and a great amount of training resources to be spent, and meanwhile, has the defect of unreliability and uncontrollability, so that the accuracy of action prediction is seriously affected. For example, the text to be predicted is "one virtual dog flies from left to right", but the finally generated motion includes the motion of turning the head of the virtual dog. This means that the conventional motion prediction method has the problems of high labor cost, low prediction efficiency and low prediction accuracy.

Disclosure of Invention

The embodiment of the application provides a data processing method, a data processing device, computer equipment and a storage medium, which can reduce the prediction cost and improve the prediction efficiency and the prediction accuracy.

An aspect of an embodiment of the present application provides a data processing method, including:

acquiring a text to be predicted associated with a virtual object;

respectively acquiring text matching degrees between M action texts in an action library and the text to be predicted, and selecting N action texts corresponding to the text to be predicted from the M action texts based on the M text matching degrees; m is a positive integer; n is a positive integer less than or equal to M;

performing action prediction on the virtual object based on the N action texts and the text to be predicted, and generating a predicted action text corresponding to the text to be predicted; predicting that the action text belongs to N action texts; the predicted action text is used for driving the virtual object to execute the action corresponding to the predicted action text.

acquiring a first sample text corresponding to a first sample object and first sample prediction data corresponding to the first sample text; the first sample prediction data is constructed based on the correct action text of the first sample object and the first sample text; the correct action text of the first sample object is determined from an action library for action prediction;

Training the first initial model based on the correct action text, the first sample text and the action library to obtain a first target model;

acquiring a second sample text corresponding to the second sample object and second sample prediction data corresponding to the second sample text; the second sample prediction data is constructed based on N candidate action texts corresponding to the second sample text and the second sample text; n candidate action texts are determined from an action library for performing action prediction; n is a positive integer;

training a second initial model based on the N candidate action texts and a second sample text to obtain a second target model; the first target model and the second target model are used for jointly predicting a predicted action text corresponding to the virtual object; the predicted action text is used for driving the virtual object to execute the action corresponding to the predicted action text.

An aspect of an embodiment of the present application provides a data processing apparatus, including:

the obtaining module is used for obtaining the text to be predicted associated with the virtual object;

the selection module is used for respectively acquiring the text matching degree between M action texts in the action library and the text to be predicted, and selecting N action texts corresponding to the text to be predicted from the M action texts based on the M text matching degrees; m is a positive integer; n is a positive integer less than or equal to M;

The generating module is used for carrying out action prediction on the virtual object based on the N action texts and the text to be predicted, and generating a predicted action text corresponding to the text to be predicted; predicting that the action text belongs to N action texts; the predicted action text is used for driving the virtual object to execute the action corresponding to the predicted action text.

The action library comprises M action texts and action vectors corresponding to the M action texts respectively;

the selection module includes:

the coding processing unit is used for inputting the text to be predicted into the first target model, and coding the text to be predicted through the first target model to obtain a text coding vector corresponding to the text to be predicted;

the similarity calculation unit is used for respectively carrying out vector similarity calculation on the text coding vector and each motion vector to obtain text matching degrees between M motion texts and the text to be predicted;

the ordering processing unit is used for carrying out descending order arrangement processing on the M text matching degrees to obtain an ordering result, and acquiring N text matching degrees from the ordering result in sequence;

and the text determining unit is used for determining the action texts corresponding to the acquired N text matching degrees as N action texts corresponding to the text to be predicted.

Wherein, this generation module includes:

the virtual object motion prediction device comprises an acquisition unit, a prediction instruction unit and a prediction unit, wherein the acquisition unit is used for acquiring a prediction instruction text for performing motion prediction on a virtual object;

the input unit is used for calling the second target model and inputting the predicted instruction text, the N action texts and the text to be predicted into the second target model;

and the generating unit is used for carrying out action prediction on the virtual object through the second target model and the N action texts, and generating a predicted action text corresponding to the text to be predicted.

The action library comprises M actions corresponding to the action texts respectively; the text to be predicted is obtained based on an action prediction request sent by the service terminal equipment; the action prediction request also comprises audio data corresponding to the text to be predicted;

the apparatus further comprises:

the action acquisition module is used for acquiring actions corresponding to the predicted action text from the M actions;

the playing time length acquisition module is used for acquiring the playing time length of the audio data;

the data alignment module is used for carrying out data alignment on the action corresponding to the predicted action text and the audio data according to the playing time length to obtain alignment data;

the data return module is used for returning the alignment data to the service terminal equipment so as to enable the service terminal equipment to display the alignment data; the alignment data is used for showing that the virtual object executes the action corresponding to the predicted action text in the playing time.

the first sample acquisition module is used for acquiring a first sample text corresponding to the first sample object and first sample prediction data corresponding to the first sample text; the first sample prediction data is constructed based on the correct action text of the first sample object and the first sample text; the correct action text of the first sample object is determined from an action library for action prediction;

the first training module is used for training the first initial model based on the correct action text, the first sample text and the action library to obtain a first target model;

the second sample acquisition module is used for acquiring a second sample text corresponding to the second sample object and second sample prediction data corresponding to the second sample text; the second sample prediction data is constructed based on N candidate action texts corresponding to the second sample text and the second sample text; n candidate action texts are determined from an action library for performing action prediction; n is a positive integer;

the second training module is used for training the second initial model based on the N candidate action texts and the second sample text to obtain a second target model; the first target model and the second target model are used for jointly predicting a predicted action text corresponding to the virtual object; the predicted action text is used for driving the virtual object to execute the action corresponding to the predicted action text.

Wherein the apparatus further comprises:

the data screening module is used for screening texts comprising dialogue information from the initial texts when the initial texts are acquired;

the data cleaning module is used for cleaning the data of the screened texts to obtain M action texts; the M action texts comprise action text D _i The method comprises the steps of carrying out a first treatment on the surface of the i is a positive integer less than or equal to M; m is a positive integer greater than or equal to N;

a motion determination module for determining a motion text D from motion assets accumulated from the history of the traffic object _i A matched action;

and the action library construction module is used for constructing an action library based on the M action texts and the M actions when the actions with the M action texts respectively matched are obtained.

Wherein, this first sample acquisition module includes:

a first acquisition unit configured to acquire a text associated with a first sample object from initial text used for constructing an action library;

the segmentation processing unit is used for carrying out segmentation processing on the text associated with the first sample object based on the sentence separator to obtain a first sample text corresponding to the first sample object;

the semantic analysis unit is used for carrying out semantic analysis on the first sample and determining correct action text of the first sample object from the action library;

The first construction unit is used for constructing first sample prediction data corresponding to the first sample text based on the correct action text and the first sample text.

The action library comprises M action texts and action vectors corresponding to the M action texts respectively; m is a positive integer greater than or equal to N;

the first training module includes:

the sample vector determining unit is used for inputting the first text sample into the first initial model, and encoding the first text sample through the first initial model to obtain a sample text vector corresponding to the first text sample;

the motion vector acquisition unit is used for acquiring motion vectors corresponding to each motion text of the P motion texts from the motion library; p is a positive integer less than or equal to M; the P action texts comprise correct action texts;

the matching degree determining unit is used for respectively carrying out vector similarity calculation on each motion vector in the P motion vectors and the sample text vector to obtain P text matching degrees;

the first training unit is used for training the first initial model based on the P text matching degrees and a first model convergence condition associated with the first initial model to obtain a first target model.

Wherein the P pieces of action text comprise a false action text set consisting of (P-1) false action texts; the false action text refers to action text except the correct action text in the P action texts;

the first training unit includes:

the matching degree determining subunit is used for determining a first text matching degree corresponding to the correct action text and a second text matching degree corresponding to each false action text from the P text matching degrees;

a first loss determination subunit configured to determine a first model loss of the first initial model based on the first text matching degree and the second text matching degree;

the first training subunit is used for training the first initial model based on the first model loss to obtain a first model training result;

and the first model determining subunit is used for determining the first target model based on the first initial model meeting the first model convergence condition if the first model training result indicates that the trained first initial model meets the first model convergence condition associated with the first initial model.

Wherein the first model determination subunit is further specifically configured to:

if the first model training result indicates that the trained first initial model meets the first model convergence condition associated with the first initial model, and the first sample prediction data comprises negative case action text, determining the first initial model meeting the first model convergence condition as a model to be processed; the negative example action text refers to an action text, in which the text matching degree between the determined action text and the first sample text is in a matching degree interval, in an action library;

Retraining the model to be processed based on the text matching degree between the negative example action text and the first text to obtain a second model training result;

if the second model training result indicates that the trained to-be-processed model meets the first model convergence condition, the to-be-processed model meeting the first model convergence condition is determined to be the first target model.

Wherein the second sample acquisition module comprises:

a second acquisition unit configured to acquire a second sample text associated with a second sample object from the initial text used for constructing the action library;

the third acquisition unit is used for acquiring N candidate action texts matched with the second initial text from the action library;

an instruction acquisition unit configured to acquire a prediction instruction text for performing motion prediction on the second sample object;

and the second construction unit is used for constructing second sample prediction data corresponding to the second sample text based on the prediction instruction text, the second sample text and the N candidate action texts.

Wherein the second sample prediction data includes a prediction instruction text for performing motion prediction on the second sample object;

the second training module includes:

the sample input unit is used for calling the second initial model and inputting the predicted instruction text, the N candidate action texts and the second sample text into the second initial model;

The prediction text determining unit is used for determining a sample prediction action text corresponding to the second sample object from the N candidate action texts through the second initial model;

and the second training unit is used for training the second initial model based on the sample prediction action text and a second model convergence condition associated with the second initial model to obtain a second target model.

Wherein the sample predictive action text includes K characters; k is a positive integer;

the second training unit includes:

a probability acquisition subunit configured to acquire a character generation probability of each of the K characters;

a second model loss determination subunit configured to determine a second model loss of the second initial model based on the K character generation probabilities;

the second training subunit is used for training the second initial model based on the second model loss to obtain a third model training result;

and the second model determining subunit is configured to determine a second target model if the third model training result indicates that the trained second initial model meets a second model convergence condition associated with the second initial model, and the second initial model meets the second model convergence condition.

In one aspect, the application provides a computer device comprising: a processor, a memory, a network interface;

The processor is connected with the memory and the network interface, wherein the network interface is used for providing a data communication function, the memory is used for storing a computer program, and the processor is used for calling the computer program so as to enable the computer device to execute the method provided by the embodiment of the application.

In one aspect, the present application provides a computer readable storage medium storing a computer program adapted to be loaded and executed by a processor, so that a computer device having the processor performs the method provided by the embodiment of the present application.

In one aspect, embodiments of the present application provide a computer program product comprising a computer program stored on a computer readable storage medium; the processor of the computer device reads the computer program from the computer-readable storage medium, and the processor executes the computer program, so that the computer device performs the method in the embodiment of the present application.

In the embodiment of the application, when the computer equipment acquires the text to be predicted associated with the virtual object, the computer equipment needs to acquire the text matching degrees between M action texts in the action library and the text to be predicted, and based on the M text matching degrees, N action texts corresponding to the text to be predicted are selected from the M action texts; m is a positive integer; n is a positive integer less than or equal to M, and then, based on N action texts and texts to be predicted, the virtual object is subjected to action prediction to generate a predicted action text corresponding to the text to be predicted. The predicted action text belongs to N action texts, and the predicted action text can be used for driving the virtual object to execute actions corresponding to the predicted action text. In other words, when the computer device predicts the motion of the virtual object, N motion texts can be recalled from the motion library as candidate motions, and then the candidate motions recalled in the previous stage are further screened, so that the motion finally executed by the virtual object can be rapidly and accurately predicted, the whole process does not need to be manually participated, the self-adaptive motion text to be predicted can be automatically generated for the virtual object according to the text to be predicted of the virtual object, the prediction cost is reduced, and the prediction efficiency is improved. In addition, the predicted action text is the action text in the action library, so that the final action executed by the virtual object can be effectively ensured to be undistorted and not out of control, safe and reliable action prediction can be realized, and the prediction accuracy is improved.

Drawings

In order to more clearly illustrate the embodiments of the application or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, it being obvious that the drawings in the following description are only some embodiments of the application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 is a schematic structural diagram of a network architecture according to an embodiment of the present application;

FIG. 2 is a schematic diagram of a scenario for motion prediction of a virtual object according to an embodiment of the present application;

FIG. 3 is a schematic flow chart of a data processing method according to an embodiment of the present application;

FIG. 4 is a schematic flow chart of a data processing method according to an embodiment of the present application;

FIG. 5 is a schematic view of a scenario in which text matching degree is determined by a coarse-rank model according to an embodiment of the present application;

FIG. 6 is a schematic view of a scenario for motion prediction by an fine-pitch model according to an embodiment of the present application;

FIG. 7 is a schematic diagram of a framework for text action prediction based on two-stage training according to an embodiment of the present application;

FIG. 8 is a schematic diagram of a data processing apparatus according to an embodiment of the present application;

FIG. 9 is a schematic diagram of a data processing apparatus according to an embodiment of the present application;

fig. 10 is a schematic diagram of a computer device according to an embodiment of the present application.

Detailed Description

The following description of the embodiments of the present application will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present application, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.

It should be appreciated that embodiments of the present application provide a motion prediction method based on two-stage training, which may be applied to the field of artificial intelligence. Among them, artificial intelligence (Artificial Intelligence, abbreviated as AI) is a theory, method, technique and application system that simulates, extends and expands human intelligence by digital computer or calculation controlled by digital computer, senses environment, acquires knowledge and obtains an optimal result by using knowledge. In other words, artificial intelligence is an integrated technology of computer science that attempts to understand the essence of intelligence and to produce a new intelligent machine that can react in a similar way to human intelligence. Artificial intelligence, i.e. research on design principles and implementation methods of various intelligent machines, enables the machines to have functions of sensing, reasoning and decision.

The artificial intelligence technology is a comprehensive subject, and relates to the technology with wide fields, namely the technology with a hardware level and the technology with a software level. Artificial intelligence infrastructure technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning, automatic driving, intelligent traffic and other directions.

Among them, natural language processing (Nature Language processing, NLP) is an important direction in the fields of computer science and artificial intelligence. It is studying various theories and methods that enable effective communication between a person and a computer in natural language. Natural language processing is a science that integrates linguistics, computer science, and mathematics. Thus, the research in this field will involve natural language, i.e. language that people use daily, so it has a close relationship with the research in linguistics. Natural language processing techniques typically include text processing, semantic understanding, machine translation, robotic questions and answers, knowledge graph techniques, and the like.

Among them, machine Learning (ML) is a multi-domain interdisciplinary, and involves multiple disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory, etc. It is specially studied how a computer simulates or implements learning behavior of a human to acquire new knowledge or skills, and reorganizes existing knowledge structures to continuously improve own performance. Machine learning is the core of artificial intelligence, a fundamental approach to letting computers have intelligence, which is applied throughout various areas of artificial intelligence. Machine learning and deep learning typically include techniques such as artificial neural networks, confidence networks, reinforcement learning, transfer learning, induction learning, teaching learning, and the like.

Referring to fig. 1, fig. 1 is a schematic structural diagram of a network architecture according to an embodiment of the present application. As shown in fig. 1, the network architecture may include a server 10F and a cluster of terminal devices. The cluster of terminal devices may comprise one or more terminal devices. As shown in fig. 1, the terminal device cluster may specifically include terminal devices 100a, 100b, 100c, …, and 100n. As shown in fig. 1, the terminal devices 100a, 100b, 100c, …, 100n may respectively perform network connection with the above-mentioned server 10F, so that each terminal device may perform data interaction with the server 10F through the network connection. The network connection is not limited to a connection manner, and may be directly or indirectly connected through a wired communication manner, may be directly or indirectly connected through a wireless communication manner, or may be other manners, which is not limited herein.

Wherein each terminal device in the terminal device cluster may include: smart terminals with data processing functions such as smart phones, tablet computers, notebook computers, desktop computers, smart speakers, smart watches, vehicle-mounted terminals, smart televisions and the like. It should be understood that each terminal device in the cluster of terminal devices shown in fig. 1 may be provided with an application client, which may interact with the server 10F shown in fig. 1, respectively, when the application client is running in each terminal device. The application clients may include, among other things, social clients, multimedia clients (e.g., video clients), entertainment clients (e.g., game clients), information flow clients, educational clients, live clients, and the like. The application client may be an independent client, or may be an embedded sub-client integrated in a client (for example, a social client, an educational client, and a multimedia client), which is not limited herein.

As shown in fig. 1, the server 10F in the embodiment of the present application may be a server corresponding to the application client. The server 10F may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing cloud computing services. The embodiment of the application does not limit the number of terminal equipment and servers.

For easy understanding, the embodiment of the present application may select one terminal device from a plurality of terminal devices shown in fig. 1 as a service terminal device. For example, the embodiment of the present application may use the terminal device 100b shown in fig. 1 as a service terminal device, where an application client may be integrated. At this time, the service terminal device may implement data interaction between the service data platform corresponding to the application client and the server 10F. Wherein the application client may be running a first object model and a second object model that have been trained.

The first target model may be a language pre-training model (also called coarse-rank model) that is already trained, that is, a language pre-training model for performing vector characterization on the text to be predicted in the first stage (coarse-rank stage), for example, the semantic pre-training model may be any one of a roberta-large model, an SBert model, and the like. The second object model may be a trained language model (also called a fine-line model), that is, a language model for further filtering N action texts recalled in the first stage in the second stage, for example, a large language model (Large Language Model, abbreviated as LLM) is used as a base model, and N is a positive integer.

In the embodiment of the present application, the computer device having the action prediction function may be a server, and may be any one of the terminal devices in the terminal device cluster shown in fig. 1, for example, the terminal device 100a, and the specific form of the computer device will not be limited herein. The computer device can also properly utilize action assets accumulated by the history of the business object (for example, the business party) to perform reliable and controllable action prediction for the virtual object, namely, an action library for performing action prediction can be constructed in advance based on the action assets accumulated by the history of the business object and the initial text. The action library may include M action texts, actions corresponding to the M action texts, and action vectors corresponding to the M action texts, where M is a positive integer.

The initial text here mainly originates from a novel, script, etc. that has been disclosed. In the implementation of the data capturing technical scheme for the initial text, when the embodiment of the application is applied to specific products or technologies, the related data collecting, using and processing processes should conform to the national legal and legal requirements, conform to legal, legal and necessary principles, do not relate to acquiring data types forbidden or limited by the legal and legal regulations, and do not hinder the normal operation of a target website.

It should be appreciated that when performing motion prediction on a virtual object, the computer device may obtain text to be predicted (e.g., a dialog context associated with the virtual object) associated with the virtual object (e.g., a virtual person, a virtual dog, etc.), and then be able to adaptively generate more accurate prediction motion text for the virtual object based on the text to be predicted of the virtual object, thereby reducing prediction cost and improving prediction efficiency.

For example, the computer device may obtain a text matching degree between the text to be predicted and each of M action texts in the action library, and may select N action texts from the M action texts based on the M text matching degrees, and use the N action texts as N action texts corresponding to the text to be predicted. Wherein, N is a positive integer less than or equal to M. Further, the computer device may perform motion prediction on the virtual object based on the N motion texts and the text to be predicted, so as to generate a predicted motion text corresponding to the virtual object. The predicted action texts belong to N action texts, namely the action texts in the action library, so that the final executed action of the virtual object can be effectively ensured to be undistorted and not out of control, safe and reliable action prediction can be realized, and the prediction accuracy is improved.

The action prediction method provided by the embodiment of the application can predict the corresponding predicted action text for the virtual object more quickly and accurately so as to drive the virtual object to execute the action corresponding to the predicted action text, so that the deduction of the virtual object is more real and more anthropomorphic, the application scene of the virtual object is effectively expanded, and the method can comprise scenes such as deduction of a virtual person, a virtual anchor, virtual explanation and the like, has wider service value, and further improves the user experience.

For ease of understanding, further, please refer to fig. 2, fig. 2 is a schematic diagram of a scenario for performing motion prediction on a virtual object according to an embodiment of the present application. As shown in fig. 2, the terminal device 20Z corresponding to the service object (i.e., the service terminal device) may be a terminal device running an application client (e.g., a video client), and the service terminal device may be any one of the terminal device clusters shown in fig. 1, for example, the terminal device 100a. The server 20F may be a computer device having a motion prediction function, and the server 20F may be the server 10F shown in fig. 1.

The server 20F may screen, clean, collect all the action texts by screening the initial texts such as novels and scripts, then randomly select M action texts for the purpose of saving resource overhead, and select actions respectively matched with the M action texts based on action assets accumulated by the service party history, so as to construct an action library shown in fig. 2, so that the subsequent coarse-ranking model and the fine-ranking model can be conveniently utilized, and a storage format in the action library may be json format. In other words, the action library may include M action texts and actions corresponding to the M action texts, where M is a positive integer, and for ease of understanding, M may be exemplified by 500.

It should be understood that, when the service object shown in fig. 2 performs a motion prediction operation (i.e., a trigger operation for motion prediction of a virtual object) with respect to the virtual object (e.g., the virtual object 20D shown in fig. 2), the terminal device 20Z may acquire the service text associated with the virtual object 20D and the original audio data corresponding to the service text in response to the motion prediction operation. The raw audio data may be obtained by combining the service text with an application interface (e.g., TTS) interface of the application client for the computer device, where the triggering operation may include non-contact operations such as voice, gesture, etc., and may also include touch operations such as clicking, long press, etc., which will not be limited herein. Where the service text may include one or more sub-texts, and 2 may be taken as an example here, if the service text includes the sub-text 21X and the sub-text 22X, the original audio data may include audio data corresponding to the sub-text 21X and audio data corresponding to the sub-text 22X.

Further, the terminal device 20Z may generate the action prediction request 2q based on the service text and the original audio data corresponding to the service text, and send the action prediction request 2q to the server 20F, so that the server 20F automatically generates the predicted action text corresponding to the text to be predicted by using the sub-text 21X and the sub-text 22X included in the action prediction request 2q as the text to be predicted, respectively. The server 20F may specifically include two phases (i.e., a coarse-row phase and a fine-row phase) when performing motion prediction on the virtual object 20D. Wherein the model 210W (i.e., the first target model) involved in the coarse-ranking stage may be used to preliminarily screen N (e.g., 10) action texts from the M action texts; the model 220W (i.e., the second target model) involved in the fine-ranking stage may be used to more accurately screen the predicted action text corresponding to the text to be predicted from the N action texts.

For example, when the server 20F determines the sub-text 21X as the text to be predicted, in the rough ranking stage, the server 20F may obtain, through the model 210W, text matching degrees between M action texts in the action library and the sub-text 21X, respectively, to obtain M text matching degrees, and further select N action texts from the M action texts as candidate texts corresponding to the sub-text 21X based on the M text matching degrees. Then, in the fine-ranking stage, the server 20F may input N pieces of motion texts and the sub-texts 21X into the model 220W, and perform motion prediction on the virtual object 20D by using the model 220W to generate a motion text 21A (i.e., a predicted motion text) corresponding to the sub-text 21X. At this time, the server 20F may acquire the action 1 (for example, "turn") corresponding to the action text 21A from the M actions included in the action library, and acquire the play duration of the audio data corresponding to the sub-text 21X from the original audio data in the action prediction request 2q, and perform data alignment on the audio data corresponding to the sub-text 21X according to the play duration, to obtain the first alignment data.

For another example, when the server 20F determines the sub-text 22X as the text to be predicted, the server 20F may obtain, in the rough ranking stage, text matching degrees between M action texts in the action library and the sub-text 22X through the model 220W, respectively, to obtain M text matching degrees, and further select N action texts from the M action texts as candidate texts corresponding to the sub-text 22X based on the M text matching degrees. Then, in the fine-ranking stage, the server 20F may input N pieces of motion texts and the sub-texts 22X into the model 220W, and perform motion prediction on the virtual object 20D by using the model 220W to generate a motion text 22A (i.e., a predicted motion text) corresponding to the sub-text 22X. At this time, the server 20F may acquire the action 2 (for example, "in eight hands") corresponding to the action text 22A from the M actions included in the action library, and acquire the play duration of the audio data corresponding to the sub-text 22X from the original audio data in the action prediction request 2q, and perform data alignment on the audio data corresponding to the sub-text 22X according to the play duration, to obtain second alignment data.

Then, the server 20F may perform a packing process on the alignment data (the first alignment data and the second alignment data) corresponding to each text to be predicted, to obtain alignment data (for example, the alignment data 2s shown in fig. 2) corresponding to the virtual object 20D, and then may return the alignment data 2s to the terminal device 20Z, so that the terminal device 20Z displays the alignment data 2s on the terminal interface. The progress bar 20T in the terminal interface may be used to indicate a playing progress corresponding to the aligned data 2s, and when the service object performs a triggering operation on a service control (for example, a preview control) in the terminal interface, the terminal device 20Z may display, through the aligned data 2s, that the virtual object 20D performs an action corresponding to a predicted action text of the virtual object within a playing duration of audio data corresponding to each text to be predicted.

In other words, in the application client, after the business text (for example, the dialogue text) and the voice are connected, the terminal device 20Z can generate the predicted action text corresponding to the virtual object 20D more quickly and accurately through the two-stage training model 210W and the model 220W, and instruct the virtual object 20D to make feedback, that is, drive the virtual object to take the action corresponding to the predicted action text, so that the whole process does not need to be manually involved, not only reducing the prediction cost, but also improving the prediction efficiency and the prediction accuracy. In addition, the embodiment of the application carries out preliminary screening through the model 210W to recall N action texts from the action library, so that the limitation of the model 220W by the length of the input text can be effectively avoided, and the capability of a large language model can be exerted to the maximum extent.

It should be noted that, the interfaces and controls illustrated in fig. 2 are only some representations that can be referred to, and in an actual business scenario, a developer may perform related design according to product requirements, so that the embodiments of the present application do not limit the specific forms of the interfaces and controls involved.

The specific implementation manner of quickly and accurately performing motion prediction on the virtual object by the computer device through the first target model (i.e. coarse-row model) and the second target model (i.e. fine-row model) according to the text to be predicted associated with the virtual object and the motion library in a self-adaptive manner can be seen in the embodiments corresponding to fig. 3-7 described below.

Further, referring to fig. 3, fig. 3 is a flow chart of a data processing method according to an embodiment of the application. As shown in fig. 3, the method may be performed by a computer device having an action prediction function, and the computer device may be a terminal device (for example, any one of the terminal devices in the terminal device cluster shown in fig. 1, for example, the terminal device 100a having a model application function) or a server (for example, the server 10F shown in fig. 1), which is not limited herein. For easy understanding, the embodiment of the present application is described by taking the method performed by a server having a motion prediction function as an example, and the method may at least include the following steps S101 to S103:

Step S101, obtaining a text to be predicted associated with the virtual object.

Specifically, the computer device may receive an action prediction request sent by a service terminal device corresponding to a service object, and may further obtain a service text associated with the virtual object based on the action prediction request, split the service text to obtain one or more sub-texts, and may further determine each sub-text as a text to be predicted associated with the virtual object. Wherein, the action prediction request is generated by the service terminal equipment based on the service text associated with the virtual object and the original audio data corresponding to the service text when responding to the action prediction operation of the service object for the virtual object.

Step S102, text matching degrees between M action texts in the action library and the text to be predicted are respectively obtained, and N action texts corresponding to the text to be predicted are selected from the M action texts based on the M text matching degrees.

The action library comprises M action texts, actions corresponding to the M action texts and action vectors corresponding to the M action texts respectively; m is a positive integer. The motion vectors (i.e., motion tokens) herein may be stored for off-line computing by the computer device to facilitate subsequent increases in inference speed. Specifically, the computer device may input the text to be predicted to a first target model (for example, the model 210W shown in fig. 2 above), perform encoding processing on the text to be predicted through the first target model to obtain a text encoding vector corresponding to the text to be predicted, and further perform vector similarity calculation on the text encoding vector and each motion vector, so as to obtain text matching degrees between M motion texts and the text to be predicted. Further, the computer device may perform a ranking process (e.g., a descending ranking process or an ascending ranking process) on the M text matching degrees, obtain a ranking result, sequentially obtain N text matching degrees from the ranking result, and determine the action texts corresponding to the obtained N text matching degrees respectively as N action texts corresponding to the text to be predicted. Wherein, N is a positive integer less than or equal to M.

It can be understood that if the motion library includes 500 motion texts and motion vectors corresponding to the 500 motion texts, after obtaining a text encoding vector corresponding to a text to be predicted, the computer device may calculate, by using a first object model, a cosine similarity or a Manhatten/Euclidean distance, etc. method, a vector similarity between the text encoding vector and the 500 motion vectors until 500 text matching degrees are obtained. The computer device may then initially screen N action texts with higher text matching (e.g., action texts with text matching belonging to TOP 10) from the 500 action texts of the action library based on the 500 text matching.

For example, the computer device may perform a descending order of the 500 text matching degrees, obtain, from the obtained ranking result (for example, the descending order ranking result), the top10 text matching degrees, obtain, from the action library, action texts corresponding to the 10 text matching degrees respectively, and then may use the action texts as 10 action texts corresponding to the text to be predicted. Optionally, the computer device may perform ascending order processing on the 500 text matching degrees, and obtain 10 text matching degrees after the ranking from the obtained ranking result (for example, ascending order ranking result), obtain action texts corresponding to the 10 text matching degrees respectively from the action library, and further may use the action texts as 10 action texts corresponding to the text to be predicted.

And step S103, performing action prediction on the virtual object based on the N action texts and the text to be predicted, and generating a predicted action text corresponding to the text to be predicted.

In particular, the computer device may obtain prediction instruction text for performing motion prediction on the virtual object. For example, the predictive instruction text may be "select a most reasonable action option for a speaker dialogue, without giving an explanation". Then, the computer device may call a second target model (for example, the model 220W shown in fig. 2 above), input the predicted instruction text, the N action texts, and the text to be predicted into the second target model, and further may perform action prediction on the virtual object through the second target model and the N action texts, to generate a predicted action text corresponding to the text to be predicted. The predicted action text belongs to N action texts, and the predicted action text is used for driving the virtual object to execute actions corresponding to the predicted action text.

For example, if the text to be predicted herein is "player: character a (e.g., virtual object), you know nothing about … … her … … likes you. Role a: i know "because of the large number of action texts in the action library, the computer device can quickly recall N action texts from the action library as action candidates through the first object model. The N action texts may be 10 examples, and specifically may include action text 1 (e.g., "turn around to see out of window"), action text 2 (e.g., "one sound coming"), action text 3 (e.g., "cold talk to see), action text 4 (e.g.," one face attached to see), action text 5 (e.g., "one face calm talk to see), action text 6 (e.g.," open to see), action text 7 (e.g., "one face calm"), action text 8 (e.g., "cold talk to see far away player"), action text 9 (e.g., "laugh to see a mole of light that flashes in the eye"), and action text 10 (e.g., "sound is in a dumb way").

Then, the computer equipment can splice the text to be predicted and the 10 action texts, input the spliced text into the second target model, and further can predict the actions of the virtual object through the second target model so as to obtain an output text. For example, the output text of the second object model may be "the most reasonable actions are: turning head looks at window. This means that the computer device can accurately generate the predicted action text corresponding to the text to be predicted through the second target model, and the action prediction mode can effectively prevent the problem of overlong input text length, so that the capability of the second target model can be exerted to the maximum extent.

After the computer device accurately predicts the predicted action text corresponding to the text to be predicted through the first target model and the second target model, the computer device needs to return response data (i.e. alignment data) corresponding to the action prediction request to the service terminal device corresponding to the service object. For example, the computer device needs to obtain an action corresponding to the predicted action text from M actions included in the action library, and then obtain a play duration of audio data corresponding to the text to be predicted from the action prediction request. Then, the computer device can perform data alignment on the action corresponding to the predicted action text and the audio data according to the playing time length to obtain alignment data, and the alignment data is returned to the service terminal device so that the service terminal device can display the alignment data. The alignment data may be used to show that the virtual object executes the action corresponding to the predicted action text within the playing duration.

In the embodiment of the application, the computer equipment can utilize an artificial intelligence technology to predict the text action, so that the labeling cost is reduced, and human resources are released, namely N action texts can be recalled from a motion library as candidate actions when the virtual object is predicted, and then the candidate actions recalled in the previous stage are further screened, so that the action finally executed by the virtual object can be rapidly and accurately predicted, the whole process does not need manual participation, the predicted action text can be adaptively generated for the virtual object according to the text to be predicted, the prediction cost is reduced, and the prediction efficiency is improved. In addition, the predicted action text is the action text in the action library, in other words, the action corresponding to the predicted action text is reliable and controllable, unexpected actions can not be generated for the virtual object, and therefore the final predicted result is stable and reliable, and the prediction accuracy is improved.

Further, referring to fig. 4, fig. 4 is a flow chart of a data processing method according to an embodiment of the application. The method may be executed by a terminal device having a data processing function (for example, any one of the terminal devices in the terminal device cluster shown in fig. 1, for example, the terminal device 100 a), may be executed by a server having a data processing function (for example, the server 10F shown in fig. 1), or may be executed interactively by a terminal device having a model application function and a server having a model training function. And are not limited herein. The method may include at least the following steps S201-S207:

In step S201, a text to be predicted associated with the virtual object is acquired.

Step S202, respectively obtaining text matching degrees between M action texts in the action library and the text to be predicted, and selecting N action texts corresponding to the text to be predicted from the M action texts based on the M text matching degrees.

Wherein M is a positive integer; n is a positive integer less than or equal to M.

Step S203, based on the N action texts and the text to be predicted, performing action prediction on the virtual object, and generating a predicted action text corresponding to the text to be predicted.

The predicted action text belongs to N action texts, and the predicted action text can be used for driving the virtual object to execute actions corresponding to the predicted action text.

By adopting the action prediction scheme provided by the embodiment of the application, safe and reliable action prediction can be realized, and action feedback can be made according to interaction between the business object and the system when the scheme falls to the ground, so that a user can feel more real and timely interaction experience, and human resources are released.

The data processing method in the embodiment of the application can comprise a model training process and a model application process. It can be understood that the steps S201 to S203 illustrate a model application process, and the detailed implementation of the model application process can be referred to the description of the steps S101 to S103 in the embodiment corresponding to fig. 3, which will not be repeated here.

The model training process may specifically include a coarse model training process and a fine model training process. The coarse model training process can be specifically described in the following steps S204 to S205; the fine-pitch model training process can be specifically described with reference to the following steps S206 to S207.

In step S204, a first sample text corresponding to the first sample object and first sample prediction data corresponding to the first sample text are obtained.

Wherein the first sample prediction data is constructed based on the correct action text of the first sample object and the first sample text, and the correct action text of the first sample object is determined from the action library for making the action prediction. Specifically, the computer device may obtain text associated with the first sample object from the initial text used to construct the action library, and may further perform segmentation processing on the text associated with the first sample object based on sentence delimiters (e.g., [ SEP ]), to obtain the first sample text corresponding to the first sample object. The computer device may then perform a semantic analysis on the first sample object to determine a correct action text for the first sample object from the action library. Further, the computer device may construct first sample prediction data corresponding to the first sample text based on the correct action text and the first sample text.

For example, the computer device may directly perform a stitching process on the correct action text and the first sample text, and determine the text after the stitching process as first sample prediction data corresponding to the first sample text, so as to train the first initial model in stages.

For example, the research shows that the negative examples close to the positive example semantic similarity can often obtain a larger gradient mean value and a smaller gradient variance at the same time, so that when the first initial model is trained in stages, the negative examples close to the positive example semantic similarity can be selected again to serve as difficult negative examples, and the rough model is trained again to further improve the effect. That is, the computer device may further determine, in the action library, an action text whose text matching degree with the first sample text is within a matching degree interval (for example, between 0.6 and 0.7) as a negative example action text (i.e., a difficult negative example), and may further construct first sample prediction data corresponding to the first sample text directly based on the first sample text, the correct action text, and the negative example action text.

For ease of understanding, further, please refer to table 1, table 1 is a schematic table of sample prediction data used in the coarse-rank stage according to an embodiment of the present application. Specifically as shown in table 1:

TABLE 1

The first sample prediction data corresponding to the object a (i.e., the first sample object) shown in table 1 is stored in json format. Where "query" may be used to represent a first sample text (e.g., dialog context) of the object a, "positive" may be used to represent correct action text screened from the action library based on the first sample, and "hard negative" may be negative case action text (i.e., difficult negative case) screened from the action library by the computer device, i.e., action text in the action library for which the determined degree of text matching with the first sample text is within a degree of matching interval.

Step S205, training the first initial model based on the correct action text, the first sample text and the action library to obtain a first target model.

Specifically, the computer device may input a first text sample to a first initial model, encode the first text sample by using the first initial model to obtain a sample text vector (i.e., a first sample vector) corresponding to the first text sample, and obtain motion vectors corresponding to each motion text of P motion texts from the motion library; p is a positive integer less than or equal to M, where the P action texts here may include correct action texts. Then, the computer device may perform vector similarity calculation on each motion vector of the P motion vectors and the sample text vector to obtain P text matching degrees, and train the first initial model based on the P text matching degrees and a first model convergence condition associated with the first initial model to obtain a first target model.

For ease of understanding, further, please refer to fig. 5, fig. 5 is a schematic diagram of a scenario in which text matching degree is determined through a coarse-rank model according to an embodiment of the present application. As shown in fig. 5, the model 500W may be a language pre-training model (e.g., a roberta-large model) for vector characterization during the coarse-row phase. The motion library shown in fig. 5 may include M motion texts and motion vectors corresponding to the M motion texts, where M may be a positive integer.

The sample text 5X may be a first sample text corresponding to a first sample object obtained by the computer device from the initial text, where the sample text 5X may be "come to the side of the object b, the object a shows Yan Yi laugh, such as hundred flowers bloom, bright eyes sparkle, and a frosty night meniscus" as shown in table 1. [ SEP ] object a, object b, you can count, which is we last collective activity three times before, visit the cultural relics exhibition, very difficult-! [ SEP ] ".

It should be appreciated that the computer device may obtain each of the P action texts from the action library shown in fig. 5The motion vectors corresponding to the motion text 1 may specifically include a motion vector E corresponding to the motion text 1 ₁ Motion vector E corresponding to motion text 2 ₂ …, and motion vector E corresponding to motion text P _P P is a positive integer less than or equal to M. Wherein the P motion vectors may include a motion vector for each of the erroneous motion texts of the sample text 5X and a motion vector corresponding to a correct motion text (e.g., motion text 1) of the sample text 5X. The false action text here is a fixed number of action texts (e.g., 64) selected by the computer device at random from the action library, and for example, the false action text here may specifically include action texts 2, …, and action text P.

It should be appreciated that the computer device may input the sample text 5X into a model 500W, and encode the sample text 5X by the model 500W to obtain the sample vector 5E. Then, the computer device may use each motion vector of the P motion vectors as a vector to be matched, and further may perform vector similarity calculation on the sample vector 5E and the vector to be matched until P text matching degrees are obtained.

For example, the computer device is generating motion vector E ₁ As vectors to be matched, the sample vector 5E and the motion vector E can be used ₁ Vector similarity calculation is carried out, and the obtained similarity score can be further used as text matching degree G ₁ (i.e., the degree of text match between sample text 5X and action text 1), and so on, the computer device may also derive a degree of text match G ₂ (i.e., text matching degree between sample text 5X and action text 2), …, text matching degree G _P (i.e., the degree of text matching between sample text 5X and action text P).

The computer device may then train the first initial model based on the P text matches and the first model convergence condition associated with the first initial model such that the text matches between the first sample text and the correct action text are increasingly higher and the text matches between the false action text are increasingly lower. Wherein the first model convergence condition herein may be that the model loss of the first initial model (i.e., the first model loss) continues for n rounds (e.g., 10 rounds) without continuing to drop, i.e., model training is stopped. Optionally, the first model convergence condition may further be that the first model loss of the first initial model is smaller than a loss threshold in the first model convergence condition, i.e. model training is stopped. It will not be limited here.

Specifically, the model loss function of the first initial model can be referred to the following formula (1):

where q is used to represent the first sample text, p ⁺ The correct action text is used for representing the corresponding first sample text; cos (q, p) ⁺ ) For representing a text match (i.e., a similarity score) between the first sample text and the correct action text; cos (q, p) ^- ) For representing the text matching (i.e., similarity score) between the first sample text and the false action text, n for representing the number of false action texts (e.g., 64), e is a natural index.

When the computer equipment trains the first initial model based on the P text matching degrees and the first model convergence condition, the first text matching degree corresponding to the correct action text and the second text matching degree corresponding to each false action text can be determined from the P text matching degrees. Then, the computer device may determine a first model loss of the first initial model based on the above formula (1), the first text matching degree, and the second text matching degree, and may train the first initial model based on the first model loss, to obtain a first model training result. If the first model training result indicates that the trained first initial model meets a first model convergence condition associated with the first initial model, the computer device may determine a first target model based on the first initial model meeting the first model convergence condition.

It may be appreciated that if the first model training result indicates that the trained first initial model meets the first model convergence condition, and the first sample prediction data does not include the negative case action text, the computer device may directly determine the first initial model that meets the first model convergence condition as the first target model. Optionally, if the first model training result indicates that the trained first initial model does not meet the first model convergence condition, the computer device may adjust model parameters of the first initial model based on a model loss function that does not meet the first model convergence condition, and further may train the adjusted first initial model until the trained adjusted first initial model meets the first model convergence condition, and then use the trained adjusted first initial model as the first target model.

For another example, if the first model training result indicates that the trained first initial model satisfies a first model convergence condition associated with the first initial model, and the first sample prediction data includes negative example action text, the computer device may determine the first initial model that satisfies the first model convergence condition as the model to be processed. Then, the computer device can retrain the model to be processed based on the formula (1) and the text matching degree between the negative example action text and the first text, so as to obtain a second model training result. If the second model training result indicates that the trained to-be-processed model meets the first model convergence condition, the computer equipment can determine the to-be-processed model meeting the first model convergence condition to be a first target model.

Step S206, obtaining a second sample text corresponding to the second sample object and second sample prediction data corresponding to the second sample text.

Wherein the second sample prediction data is constructed based on N candidate action texts corresponding to the second sample text, which are determined from an action library for performing action prediction, and the second sample text; n is a positive integer. The second sample object may or may not be the same as the first sample object used to train the coarse model, and will not be limited herein. In particular, the computer device may obtain a second sample text associated with a second sample object from the initial text used to construct the action library, and may further obtain N candidate action texts from the action library that match the second initial text. It will be appreciated that the N candidate action texts may be randomly screened by the computer device directly from the action library, or may be preliminarily screened from M action texts in the action library by using a trained coarse-rank model (i.e., the first target model), which will not be limited herein. The computer device may then obtain a prediction instruction text for performing motion prediction on the second sample object, and may construct second sample prediction data corresponding to the second sample text based on the prediction instruction text, the second sample text, and the N candidate motion texts.

For ease of understanding, further, please refer to table 2, table 2 is a schematic table of sample prediction data used in the fine-pitch stage according to an embodiment of the present application. As shown in table 2:

TABLE 2

The input data of the fine-pitch model may be configured as a plurality of selection questions, where "input" may include a second sample text corresponding to the object b (i.e., a second sample object) and N candidate action texts. "construction" may be used to represent the prediction instruction text for predicting the action of object b.

Step S207, training the second initial model based on the N candidate action texts and the second sample text to obtain a second target model.

Specifically, the computer device may call the second initial model, input the prediction instruction text, the N candidate action texts, and the second sample text into the second initial model, and further may determine, from the N candidate action texts, the sample prediction action text corresponding to the second sample object through the second initial model. The computer device may then train the second initial model to obtain a second target model based on the sample predicted action text and a second model convergence condition associated with the second initial model. The first target model and the second target model may be used for jointly predicting a predicted action text corresponding to the virtual object, and the predicted action text is used for driving the virtual object to execute an action corresponding to the predicted action text.

For ease of understanding, further, please refer to fig. 6, fig. 6 is a schematic diagram of a scenario of motion prediction by an fine-pitch model according to an embodiment of the present application. As shown in fig. 6, the model 600W may be a language model (e.g., LLM) for performing motion prediction in the fine-pitch phase.

Wherein the sample text 6X may be a second sample text corresponding to a second sample object obtained by the computer device from the initial text, the sample text 6X may be "player: object b, shown in table 2 above, you know nothing about … … her … … likes you. Object b i know. It should be appreciated that, in order to effectively avoid the limitation of the model 600W by the input text length, the N candidate action texts corresponding to the sample text 6X are initially screened out from the action library by the first target model, which may specifically include the candidate action text 1, the candidate action texts 2 and …, and the candidate action text N.

For example, the computer device may input the sample text 6X into the first target model, and may encode the sample text 6X through the first target model to obtain a sample text vector (i.e., a second sample vector) corresponding to the sample text 6X. Then, the computer device may perform vector similarity calculation on each motion vector in the motion library and the sample text vector corresponding to the sample text 6X, so as to obtain text matching degrees between the M motion texts and the sample text 6X. Further, the computer device may perform a ranking process (for example, a descending ranking process or an ascending ranking process) on the M text matching degrees, sequentially acquire N text matching degrees from the ranking result, and determine the action texts corresponding to the acquired N text matching degrees as N candidate action texts corresponding to the sample text 6X, respectively.

The computer device may then perform domain tuning on the model 600W by inputting the N candidate action texts, the sample text 6X, and the prediction instruction text for performing action prediction on the second sample object, for example, the N candidate action texts, the sample text 6X, and the prediction instruction text into the model 600W together, and performing action prediction on the second sample object through the model 600W to obtain an output text of the model 600W. The output text here may include a sample predicted action text (e.g., candidate action text 1) corresponding to the determined second sample object from among the N candidate action texts.

Further, the computer device may train the second initial model based on the sample predicted action text and a second model convergence condition associated with the second initial model to obtain a second target model. Wherein the second model convergence condition herein may be that the model loss of the second initial model (i.e., the first model loss) continues for n rounds (e.g., 10 rounds) without continuing to drop, i.e., model training is stopped. Alternatively, the second model convergence condition may be that the second model loss of the second initial model is smaller than the loss threshold in the second model convergence condition, that is, model training is stopped. It will not be limited here.

Specifically, the model loss function of the second initial model can be found in the following formula (2):

wherein K can be used to represent the total number of characters in the sample predicted action text, K being a positive integer; p (P) _j May be used to represent the generation of the jth character (e.g., character y _j ) X is used to represent a vocabulary (e.g., a vocabulary of N candidate action texts), y _＜j May be used to represent historical characters generated before the jth character; the optimized loss function may be cross entropy.

It should be appreciated that where K characters are included in the sample predicted action text, the computer device may obtain a character generation probability for each of the K characters, and may determine a second model loss for the second initial model based on the K character generation probabilities. Further, the computer device may train the second initial model based on the second model loss to obtain a third model training result. If the third model training result indicates that the trained second initial model meets the second model convergence condition associated with the second initial model, the computer equipment can directly determine the second target model from the second initial model meeting the second model convergence condition so as to realize finer-granularity action prediction.

For ease of understanding, further, please refer to fig. 7, fig. 7 is a schematic diagram of a text motion prediction framework based on two-stage training according to an embodiment of the present application. As shown in fig. 7, the framework may include training and reasoning processes corresponding to the coarse and fine models, respectively.

It should be understood that, in the coarse-ranking stage in the training process, the computer device in the embodiment of the application does not need a large amount of high-quality video annotation data, can well combine the action assets accumulated by the history of the service object (for example, the service party) and the acquired initial text (for example, the novel, the script and the like), and construct an action library for performing action prediction, thereby being capable of performing reliable and controllable action prediction, reducing risk and effectively ensuring that the predicted actions executed by the virtual object are reliable, controllable and high in safety. For example, the computer device, when acquired, may extract action text from the initial text, determine actions from the action assets that match the action text, and build an action library.

Specifically, when the computer device acquires the initial text, the text (also called dialogue context text) including dialogue information can be screened from the initial text, and then the screened text can be subjected to data cleaning to obtain M action texts. Data cleansing refers to the process of re-examining and checking the screened text, with the aim of deleting duplicate information, correcting errors present, and providing data consistency.

Wherein the M action texts here include action text D _i The method comprises the steps of carrying out a first treatment on the surface of the i is less than or equal toAnd is equal to a positive integer of M. The computer device is then able to determine the text D of the action from the action assets accumulated from the history of the action objects _i And when the matched actions obtain the actions with M action texts respectively matched, constructing an action library based on the M action texts and the M actions. Of course, to subsequently increase the speed of reasoning of the first object model, the computer device may calculate vector representations (i.e., motion vectors) for constructing the motion texts in the motion library offline, and then may construct the motion library based on the M motion texts, the motion for which each motion text matches, and the motion vector for which each motion text corresponds.

When the computer equipment adopts a motion prediction scheme of two-stage training, the motion prediction scheme needs to be constructed firstly, and particularly can comprise the construction of a motion library, the construction of coarse-row data (namely first sample prediction data) in a coarse-row stage and the construction of fine-row data (namely second sample prediction data) in a fine-row stage, so that a coarse-row model and a fine-row model can be trained in stages based on the constructed data, motion vectors of motion texts in an offline computing motion library are stored in the coarse-row stage, fine reasoning is carried out through the fine-row model, and compared with the case that one model is directly used for reasoning, the prediction speed can be effectively improved, the time delay is lower, the landing capability is stronger, and the computer equipment is easy to be on line.

The training patterns herein may include, among others, a first training pattern (e.g., independent training), a second training pattern (e.g., training a fine-line model with a coarse-line model), a third training pattern (e.g., training pattern that introduces knowledge distillation), and a fourth training pattern (e.g., joint training pattern).

For the first training mode, the computer device may obtain a first sample text corresponding to the first sample object and first sample prediction data corresponding to the first sample text. Wherein the first sample prediction data is constructed based on the correct action text of the first sample object determined from the action library after the semantic analysis of the first sample by the computer device and the first sample text. Then, referring to step S205 in the embodiment corresponding to fig. 4, the computer device trains the first initial model (coarse-row model to be trained) based on the correct action text, the first sample text and the action library, to obtain a first target model for preliminary screening of M action texts in the action library.

Further, the computer device may obtain a second sample text corresponding to the second sample object and second sample prediction data corresponding to the second sample text. Wherein the second sample prediction data is constructed based on the N candidate action texts corresponding to the second sample text and the second sample text; the N candidate action texts are randomly determined by the computer device directly from an action library for action prediction; n is a positive integer. Then, the computer device may refer to step S207 in the embodiment corresponding to fig. 4, and train the second initial model based on the N candidate action texts and the second sample text, to obtain a second target model for performing the second screening on the N candidate action texts.

For the second training mode, the computer device may first obtain a first sample text corresponding to the first sample object and first sample prediction data corresponding to the first sample text, and then train the first initial model based on the correct action text, the first sample text and the action library of the first sample object in the first sample prediction data, so as to obtain a first target model for performing preliminary screening on M action texts in the action library.

Further, the computer device may obtain a second sample text corresponding to the second sample object and second sample prediction data corresponding to the second sample text. The second sample prediction data is constructed based on N candidate action texts corresponding to the second sample texts and the second sample texts, and the second sample texts can be first sample texts used for training the first initial model, and the N action texts are N action texts which are screened from the action library in a preliminary mode and are ranked at the top through the first target model, wherein the N candidate action texts correspond to the second sample texts. Then, the computer device may refer to step S207 in the embodiment corresponding to fig. 4, and train the second initial model based on the N candidate action texts and the second sample text, to obtain a second target model for performing the second screening on the N candidate action texts.

For the third training mode, since the coarse-ranking model is a double-tower model and is a post-interaction, the information interaction between the dialogue text (query) and the action (action) is lacking, so that the effect is generally inferior to that of the fine-ranking model (single tower). In order to improve the recall rate (recovery) of the coarse-row model, the embodiment of the application can select an off-line training fine-row model, and distill the capability of the fine-row model to the coarse-row model through knowledge distillation. The method comprises the steps of firstly training a fine-ranking model with higher precision in an off-line mode, then training a coarse-ranking model, and pulling up the outputs of the coarse-ranking model and the fine-ranking model, so that the coarse-ranking model is used as a student model, the fine-ranking model is used as a teacher model (teacher model), interaction characteristics of the teacher model can be learned under the condition that deep interaction is not carried out on the coarse-ranking model, and reasoning can be directly carried out by using the coarse-ranking model when the coarse-ranking model is on line.

For example, the computer device may first obtain a second sample text corresponding to the second sample object and second sample prediction data corresponding to the second sample text. Wherein the second sample prediction data is constructed based on the N candidate action texts corresponding to the second sample text and the second sample text; the N candidate action texts are randomly determined by the computer device directly from an action library for action prediction; n is a positive integer. Then, the computer device may refer to step S207 in the embodiment corresponding to fig. 4, and train the second initial model based on the N candidate action texts and the second sample text, to obtain a second target model for performing the second screening on the N candidate action texts.

Further, the computer device may use the second target model that has been trained as a teacher model, use the first initial model as a student model, and then, when a first sample text corresponding to a first sample object for training the first initial model is obtained, input the first sample text and N candidate action texts that are preliminarily screened from the action library into the second target model, and accurately predict a sample prediction action text corresponding to the first sample text by using the second target model, so that a sample prediction action text corresponding to the first sample text may be used as a correct action text of the first sample text. At this time, the computer device may construct first sample prediction data based on the first text and the correct action text of the first sample text, and refer to step S205 in the embodiment corresponding to fig. 4, where the first initial model (coarse-rank model to be trained) is trained based on the correct action text, the first sample text, and the action library, so as to obtain the first target model for performing the preliminary screening on the M action texts in the action library.

For the fourth training mode, the computer device may train the first initial model and the second initial model simultaneously, and the trained loss function may be determined based on the cross entropy and a divergence coefficient for characterizing the first initial model and the second initial model, which will not be described in detail herein.

In the reasoning process, the computer device may construct the text to be predicted according to the coarse task format based on the acquired service text in the coarse task stage, and then may utilize the first target model to recall N action texts with top ranking of the text matching degree between the text to be predicted and the action library, and then utilize the second target model to perform secondary screening on the N action texts to obtain the predicted action text corresponding to the text to be predicted, which may be specifically referred to the description of step S101 to step S103 in the embodiment corresponding to fig. 3, and will not be described further herein.

In an embodiment of the present application, a method for automatically predicting actions of a virtual object according to a dialog context (i.e., text to be predicted) through two-stage training is provided. The necessity of two-stage training is that the model cannot match all actions in the dialogue context and the action library at one time due to the length of the input text, so that in the coarse-ranking stage, according to the first sample text and possible action candidates, the first initial model can be subjected to contrast learning training, so that the similarity score between the first sample text and the correct action text is larger, the similarity score between the first sample text and the incorrect action text is smaller, and further, when the trained first initial model (namely the first target model) model is used for reasoning, possible N (for example, 10) candidate actions can be quickly recalled from the action library, and the limitation of the reasoning speed and the length of the input text is improved. In the fine-ranking stage, the reasoning capability of the second target model (for example, a large language model) can be utilized to further screen 10 candidate actions recalled in the previous stage, so that what actions are executed by the virtual object can be accurately predicted. In addition, the finally obtained predicted action text belongs to the existing action library, so that action assets accumulated by a service party can be well utilized to reliably and controllably predict the virtual object, and the interactive experience of a user is effectively ensured.

Further, referring to fig. 8, fig. 8 is a schematic structural diagram of a data processing apparatus according to an embodiment of the present application. As shown in fig. 8, the data processing apparatus 1 may include: the system comprises an acquisition module 100, a selection module 200, a generation module 300, an action acquisition module 400, a play duration acquisition module 500, a data alignment module 600 and a data return module 700.

The obtaining module 100 is configured to obtain text to be predicted associated with a virtual object;

the selection module 200 is configured to obtain text matching degrees between M motion texts in the motion library and the text to be predicted, and select N motion texts corresponding to the text to be predicted from the M motion texts based on the M text matching degrees; m is a positive integer; n is a positive integer less than or equal to M.

the selection module 200 includes: an encoding processing unit 2010, a similarity calculation unit 2020, a sorting processing unit 2030, and a text determination unit 2040.

The encoding processing unit 2010 is configured to input a text to be predicted into a first target model, and encode the text to be predicted through the first target model to obtain a text encoding vector corresponding to the text to be predicted;

The similarity calculating unit 2020 is configured to perform vector similarity calculation on the text encoding vector and each motion vector, respectively, to obtain text matching degrees between M motion texts and the text to be predicted;

the sorting processing unit 2030 is configured to perform descending order sorting processing on the M text matching degrees to obtain a sorting result, and sequentially obtain N text matching degrees from the sorting result;

the text determining unit 2040 is configured to determine, as N action texts corresponding to the text to be predicted, the action texts corresponding to the N obtained text matching degrees, respectively.

The specific implementation manner of the encoding processing unit 2010, the similarity calculating unit 2020, the sorting processing unit 2030 and the text determining unit 2040 may refer to the description of step S102 in the embodiment corresponding to fig. 3, and the detailed description will not be repeated here.

The generating module 300 is configured to perform motion prediction on the virtual object based on the N motion texts and the text to be predicted, and generate a predicted motion text corresponding to the text to be predicted; predicting that the action text belongs to N action texts; the predicted action text is used for driving the virtual object to execute the action corresponding to the predicted action text.

Wherein, the generating module 300 includes: acquisition unit 3010, input unit 3020, and generation unit 3030.

The acquiring unit 3010 is configured to acquire a prediction instruction text for performing motion prediction on a virtual object;

the input unit 3020 is configured to call the second target model, and input the prediction instruction text, the N action texts, and the text to be predicted to the second target model;

the generating unit 3030 is configured to perform motion prediction on the virtual object through the second target model and the N motion texts, so as to generate a predicted motion text corresponding to the text to be predicted.

The specific implementation manner of the obtaining unit 3010, the input unit 3020 and the generating unit 3030 may be referred to the description of step S103 in the embodiment corresponding to fig. 3, which will not be further described herein.

the action obtaining module 400 is configured to obtain an action corresponding to the predicted action text from M actions;

the playing time length obtaining module 500 is configured to obtain a playing time length of the audio data;

the data alignment module 600 is configured to perform data alignment on the action corresponding to the predicted action text and the audio data according to the play duration, so as to obtain aligned data;

The data return module 700 is configured to return the alignment data to the service terminal device, so that the service terminal device displays the alignment data; the alignment data is used for showing that the virtual object executes the action corresponding to the predicted action text in the playing time.

The specific implementation manners of the obtaining module 100, the selecting module 200, the generating module 300, the action obtaining module 400, the playing duration obtaining module 500, the data aligning module 600 and the data returning module 700 may be referred to the description of the steps S101 to S103 in the embodiment corresponding to fig. 3, and will not be further described herein. In addition, the description of the beneficial effects of the same method is omitted.

Further, referring to fig. 9, fig. 9 is a schematic structural diagram of a data processing apparatus according to an embodiment of the present application. As shown in fig. 9, the data processing apparatus 2 may include: the system comprises a first sample acquisition module 10, a first training module 20, a second sample acquisition module 30, a second training module 40, a data screening module 50, a data cleaning module 60, an action determination module 70 and an action library construction module 80.

The first sample acquiring module 10 is configured to acquire a first sample text corresponding to a first sample object and first sample prediction data corresponding to the first sample text; the first sample prediction data is constructed based on the correct action text of the first sample object and the first sample text; the correct action text for the first sample object is determined from the library of actions used to make the action prediction.

Wherein the first sample acquisition module 10 includes: a first acquisition unit 101, a segmentation processing unit 102, a semantic analysis unit 103, and a first construction unit 104.

The first obtaining unit 101 is configured to obtain a text associated with a first sample object from initial texts for constructing an action library;

the segmentation processing unit 102 is configured to perform segmentation processing on a text associated with a first sample object based on sentence delimiters, so as to obtain a first sample text corresponding to the first sample object;

the semantic analysis unit 103 is configured to perform semantic analysis on the first sample, and determine a correct action text of the first sample object from the action library;

the first construction unit 104 is configured to construct first sample prediction data corresponding to the first sample text based on the correct action text and the first sample text.

The specific implementation manner of the first obtaining unit 101, the segmentation processing unit 102, the semantic analysis unit 103, and the first construction unit 104 may be referred to the description of step S204 in the embodiment corresponding to fig. 4, and the description thereof will not be repeated here.

The first training module 20 is configured to train the first initial model based on the correct action text, the first sample text and the action library, to obtain a first target model.

the first training module 20 includes: a sample vector determination unit 201, an action vector acquisition unit 202, a matching degree determination unit 203, and a first training unit 204.

The sample vector determining unit 201 is configured to input a first sample to a first initial model, and encode the first sample through the first initial model to obtain a sample text vector corresponding to the first sample;

the motion vector obtaining unit 202 is configured to obtain, from the motion library, a motion vector corresponding to each motion text of the P motion texts; p is a positive integer less than or equal to M; the P action texts comprise correct action texts;

the matching degree determining unit 203 is configured to perform vector similarity calculation on each motion vector of the P motion vectors and the sample text vector, so as to obtain P text matching degrees;

the first training unit 204 is configured to train the first initial model based on the P text matching degrees and a first model convergence condition associated with the first initial model, to obtain a first target model.

the first training unit 204 includes: a matching degree determination subunit 2041, a first loss determination subunit 2042, a first training subunit 2043, and a first model determination subunit 2044.

The matching degree determining subunit 2041 is configured to determine, from the P text matching degrees, a first text matching degree corresponding to the correct action text and a second text matching degree corresponding to each of the error action texts;

the first loss determination subunit 2042 is configured to determine a first model loss of the first initial model based on the first text matching degree and the second text matching degree;

the first training subunit 2043 is configured to train the first initial model based on the first model loss, to obtain a first model training result;

the first model determining subunit 2044 is configured to determine, based on the first initial model that satisfies the first model convergence condition, a first target model if the first model training result indicates that the trained first initial model satisfies the first model convergence condition associated with the first initial model.

Wherein the first model determination subunit 2044 is further specifically configured to:

The specific implementation manner of the matching degree determining subunit 2041, the first loss determining subunit 2042, the first training subunit 2043, and the first model determining subunit 2044 may be referred to the description of the training of the first initial model in the embodiment corresponding to fig. 4, and will not be further described herein.

The specific implementation manners of the sample vector determining unit 201, the motion vector obtaining unit 202, the matching degree determining unit 203, and the first training unit 204 may be referred to the description of step S205 in the embodiment corresponding to fig. 4, and the detailed description will not be repeated here.

The second sample acquiring module 30 is configured to acquire a second sample text corresponding to a second sample object and second sample prediction data corresponding to the second sample text; the second sample prediction data is constructed based on N candidate action texts corresponding to the second sample text and the second sample text; n candidate action texts are determined from an action library for performing action prediction; n is a positive integer.

Wherein the second sample acquisition module 30 comprises: a second fetch unit 301, a third fetch unit 302, an instruction fetch unit 303, and a second construction unit 304.

The second obtaining unit 301 is configured to obtain, from the initial text used for constructing the action library, a second sample text associated with a second sample object;

the third obtaining unit 302 is configured to obtain N candidate action texts matched with the second initial text from the action library;

the instruction acquisition unit 303 is configured to acquire a prediction instruction text for performing motion prediction on the second sample object;

The second construction unit 304 is configured to construct second sample prediction data corresponding to the second sample text based on the prediction instruction text, the second sample text, and the N candidate action texts.

The specific implementation manner of the second acquiring unit 301, the third acquiring unit 302, the instruction acquiring unit 303 and the second constructing unit 304 may be referred to the description of step S206 in the embodiment corresponding to fig. 4, and the detailed description will not be repeated here.

The second training module 40 is configured to train the second initial model based on the N candidate action texts and the second sample text, to obtain a second target model; the first target model and the second target model are used for jointly predicting a predicted action text corresponding to the virtual object; the predicted action text is used for driving the virtual object to execute the action corresponding to the predicted action text.

the second training module 40 includes: a sample input unit 401, a predicted text determination unit 402, and a second training unit 403.

The sample input unit 401 is configured to invoke a second initial model, and input a prediction instruction text, N candidate action texts, and a second sample text into the second initial model;

The predicted text determining unit 402 is configured to determine, from the N candidate action texts, a sample predicted action text corresponding to the second sample object through the second initial model;

the second training unit 403 is configured to train the second initial model based on the sample prediction action text and a second model convergence condition associated with the second initial model, to obtain a second target model.

the second training unit 403 includes: a probability acquisition subunit 4031, a second loss determination subunit 4032, a second training subunit 4033, and a second model determination subunit 4034.

The probability obtaining subunit 4031 is configured to obtain a character generation probability of each of the K characters;

the second loss determination subunit 4032 is configured to determine a second model loss of the second initial model based on the K character generation probabilities;

the second training subunit 4033 is configured to train the second initial model based on the second model loss, to obtain a third model training result;

the second model determining subunit 4034 is configured to determine a second target model if the third model training result indicates that the trained second initial model meets the second model convergence condition associated with the second initial model, and the second initial model meeting the second model convergence condition.

The specific implementation manner of the probability obtaining subunit 4031, the second loss determining subunit 4032, the second training subunit 4033, and the second model determining subunit 4034 may be referred to the description of the training of the second initial model in the embodiment corresponding to fig. 4, which will not be further described herein.

The specific implementation manner of the sample input unit 401, the predicted text determining unit 402, and the second training unit 403 may be referred to the description of step S207 in the embodiment corresponding to fig. 4, and the detailed description will not be repeated here.

The data screening module 50 is configured to screen, when an initial text is acquired, a text including dialogue information from the initial text;

the data cleaning module 60 is configured to perform data cleaning on the screened text to obtain M action texts; the M action texts comprise action text D _i The method comprises the steps of carrying out a first treatment on the surface of the i is a positive integer less than or equal to M; m is a positive integer greater than or equal to N;

the action determining module 70 is used for determining the action text D from action assets accumulated from the history of the service objects _i A matched action;

the action library construction module 80 is configured to construct an action library based on the M action texts and the M actions when the M actions texts are respectively matched.

The specific implementation manner of the first sample acquiring module 10, the first training module 20, the second sample acquiring module 30, the second training module 40, the data filtering module 50, the data cleaning module 60, the action determining module 70 and the action library constructing module 80 can be referred to the description of the steps S101-S103 in the embodiment corresponding to fig. 3 and the steps S201-S207 in the embodiment corresponding to fig. 4, and will not be further described herein. In addition, the description of the beneficial effects of the same method is omitted.

Further, referring to fig. 10, fig. 10 is a schematic diagram of a computer device according to an embodiment of the application. As shown in fig. 10, the computer device 1000 may be a computer device having a motion prediction function, and the computer device 1000 may include: at least one processor 1001, e.g., a CPU, at least one network interface 1004, memory 1005, at least one communication bus 1002. Wherein the communication bus 1002 is used to enable connected communication between these components. The network interface 1004 may optionally include a standard wired interface, a wireless interface (e.g., WI-FI interface). The memory 1005 may be a high-speed RAM memory or a non-volatile memory (non-volatile memory), such as at least one disk memory. The memory 1005 may also optionally be at least one storage device located remotely from the aforementioned processor 1001. As shown in fig. 10, the memory 1005, which is one type of computer storage medium, may include an operating system, a network communication module, a user interface module, and a device control application. In some embodiments, the computer device may further include a user interface 1003 shown in fig. 10, for example, if the computer device is a terminal device (for example, the terminal device 100 a) with a motion prediction function shown in fig. 1, the computer device may further include the user interface 1003, where the user interface 1003 may include a Display screen (Display), a Keyboard (Keyboard), and so on.

In the computer device 1000 shown in fig. 10, the network interface 1004 is mainly used for network communication; while user interface 1003 is primarily used as an interface for providing input to a user; and the processor 1001 may be used to invoke a device control application stored in the memory 1005 to implement:

acquiring a text to be predicted associated with a virtual object;

The processor 1001 may also be used to invoke a device control application stored in the memory 1005 to implement:

It should be understood that the computer device 1000 described in the embodiment of the present application may perform the description of the data processing method in the embodiment corresponding to fig. 3 and 4, or may perform the description of the data processing apparatus 1 in the embodiment corresponding to fig. 8, or the description of the data processing apparatus 2 in the embodiment corresponding to fig. 9, which are not repeated herein. In addition, the description of the beneficial effects of the same method is omitted.

The embodiment of the present application further provides a computer readable storage medium, where a computer program is stored, where the computer program includes program instructions, and when executed by a processor, implement a data processing method provided by each step in fig. 3 or fig. 4, and specifically refer to an implementation manner provided by each step in fig. 3 or fig. 4, which is not described herein again.

The computer readable storage medium may be the data transmission apparatus provided in any of the foregoing embodiments or an internal storage unit of a computer device, for example, a hard disk or a memory of the computer device. The computer readable storage medium may also be an external storage device of the computer device, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) card, a flash card (flash card) or the like, which are provided on the computer device. Further, the computer-readable storage medium may also include both internal storage units and external storage devices of the computer device. The computer-readable storage medium is used to store the computer program and other programs and data required by the computer device. The computer-readable storage medium may also be used to temporarily store data that has been output or is to be output.

Embodiments of the present application also provide a computer program product comprising a computer program stored in a computer readable storage medium. The processor of the computer device reads the computer program from the computer readable storage medium, and the processor executes the computer program, so that the computer device may perform the description of the data processing method or apparatus in the foregoing embodiments, which is not described herein. In addition, the description of the beneficial effects of the same method is omitted.

Those skilled in the art will appreciate that implementing all or part of the above-described methods may be accomplished by way of computer programs, which may be stored on a computer-readable storage medium, and which, when executed, may comprise the steps of the embodiments of the methods described above. The storage medium may be a magnetic disk, an optical disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), or the like.

The foregoing disclosure is illustrative of the present application and is not to be construed as limiting the scope of the application, which is defined by the appended claims.

Claims

1. A method of data processing, comprising:

acquiring a text to be predicted associated with a virtual object;

respectively acquiring text matching degrees between M action texts in an action library and the text to be predicted, and selecting N action texts corresponding to the text to be predicted from the M action texts based on the M text matching degrees; m is a positive integer; the N is a positive integer less than or equal to M;

performing action prediction on the virtual object based on the N action texts and the text to be predicted, and generating a predicted action text corresponding to the text to be predicted; the predicted action text belongs to the N action texts; the predicted action text is used for driving the virtual object to execute the action corresponding to the predicted action text.

2. The method of claim 1, wherein the action library comprises M action texts and action vectors corresponding to the M action texts, respectively;

the obtaining text matching degrees between the M action texts in the action library and the text to be predicted respectively, and selecting N action texts corresponding to the text to be predicted from the M action texts based on the M text matching degrees, includes:

Inputting the text to be predicted into a first target model, and encoding the text to be predicted through the first target model to obtain a text encoding vector corresponding to the text to be predicted;

vector similarity calculation is carried out on the text coding vector and each motion vector respectively, so that text matching degrees between the M motion texts and the text to be predicted are obtained;

performing descending order arrangement treatment on the M text matching degrees to obtain an ordering result, and sequentially obtaining N text matching degrees from the ordering result;

and determining the action texts corresponding to the N text matching degrees as N action texts corresponding to the text to be predicted.

3. The method of claim 1, wherein the performing motion prediction on the virtual object based on the N motion texts and the text to be predicted, and generating the predicted motion text corresponding to the text to be predicted, comprises:

acquiring a prediction instruction text for performing action prediction on the virtual object;

invoking a second target model, and inputting the prediction instruction text, the N action texts and the text to be predicted into the second target model;

And performing motion prediction on the virtual object through the second target model and the N motion texts to generate a predicted motion text corresponding to the text to be predicted.

4. The method according to claim 1, wherein the action library comprises actions corresponding to M action texts respectively; the text to be predicted is obtained based on an action prediction request sent by the service terminal equipment; the action prediction request also comprises audio data corresponding to the text to be predicted;

the method further comprises the steps of:

acquiring actions corresponding to the predicted action text from M actions;

acquiring the playing time length of the audio data;

according to the playing time length, performing data alignment on the action corresponding to the predicted action text and the audio data to obtain aligned data;

returning the alignment data to the service terminal equipment so that the service terminal equipment displays the alignment data; and the alignment data is used for showing that the virtual object executes the action corresponding to the predicted action text in the playing time.

5. A method of data processing, comprising:

Training a first initial model based on the correct action text, the first sample text and the action library to obtain a first target model;

acquiring a second sample text corresponding to a second sample object and second sample prediction data corresponding to the second sample text; the second sample prediction data is constructed based on N candidate action texts corresponding to the second sample text and the second sample text; the N candidate action texts are determined from an action library for action prediction; n is a positive integer;

training a second initial model based on the N candidate action texts and the second sample text to obtain a second target model; the first target model and the second target model are used for jointly predicting a predicted action text corresponding to the virtual object; the predicted action text is used for driving the virtual object to execute the action corresponding to the predicted action text.

6. The method of claim 5, wherein the method further comprises:

when an initial text is acquired, a text comprising dialogue information is screened out from the initial text;

Data cleaning is carried out on the screened texts to obtain M action texts; the M action texts comprise action text D _i The method comprises the steps of carrying out a first treatment on the surface of the i is a positive integer less than or equal to M; m is a positive integer greater than or equal to N;

determining a motion text D from motion assets accumulated from a history of a service object _i A matched action;

when the actions with M action texts respectively matched are obtained, the action library is constructed based on the M action texts and the M actions.

7. The method of claim 5, wherein the obtaining the first sample text corresponding to the first sample object and the first sample prediction data corresponding to the first sample text comprises:

acquiring a text associated with a first sample object from initial text used for constructing an action library;

based on the sentence separator, segmenting the text associated with the first sample object to obtain a first sample text corresponding to the first sample object;

performing semantic analysis on the first sample, and determining correct action text of the first sample object from the action library;

and constructing first sample prediction data corresponding to the first sample text based on the correct action text and the first sample text.

8. The method of claim 5, wherein the action library comprises M action texts and action vectors corresponding to the M action texts, respectively; m is a positive integer greater than or equal to N;

training a first initial model based on the correct action text, the first sample text and the action library to obtain a first target model, wherein the training comprises the following steps:

inputting the first text sample into a first initial model, and encoding the first text sample through the first initial model to obtain a sample text vector corresponding to the first sample text;

obtaining motion vectors corresponding to each motion text of the P motion texts from the motion library; p is a positive integer less than or equal to M; the P action texts comprise the correct action texts;

respectively carrying out vector similarity calculation on each motion vector in the P motion vectors and the sample text vector to obtain P text matching degrees;

and training the first initial model based on the P text matching degrees and a first model convergence condition associated with the first initial model to obtain a first target model.

9. The method of claim 8, wherein the P action texts comprise a set of (P-1) error action texts; the false action text refers to action texts except the correct action text in the P action texts;

training the first initial model based on the P text matching degrees and a first model convergence condition associated with the first initial model to obtain a first target model, including:

determining a first text matching degree corresponding to the correct action text and a second text matching degree corresponding to each false action text from the P text matching degrees;

determining a first model loss for the first initial model based on the first text matching degree and the second text matching degree;

training the first initial model based on the first model loss to obtain a first model training result;

and if the first model training result indicates that the trained first initial model meets the first model convergence condition associated with the first initial model, determining a first target model based on the first initial model meeting the first model convergence condition.

10. The method of claim 9, wherein if the first model training result indicates that the trained first initial model satisfies a first model convergence condition associated with the first initial model, determining a first target model based on the first initial model satisfying the first model convergence condition comprises:

if the first model training result indicates that the trained first initial model meets a first model convergence condition associated with the first initial model, and the first sample prediction data comprises negative example action text, determining the first initial model meeting the first model convergence condition as a model to be processed; the negative example action text refers to action text, in the action library, of which the text matching degree with the first sample text is determined to be in a matching degree interval;

and if the second model training result indicates that the trained to-be-processed model meets the first model convergence condition, determining a first target model according to the to-be-processed model meeting the first model convergence condition.

11. The method of claim 5, wherein the obtaining the second sample text corresponding to the second sample object and the second sample prediction data corresponding to the second sample text comprises:

obtaining a second sample text associated with a second sample object from the initial text used for constructing the action library;

obtaining N candidate action texts matched with the second initial text from the action library;

acquiring a prediction instruction text for performing motion prediction on the second sample object;

and constructing second sample prediction data corresponding to the second sample text based on the prediction instruction text, the second sample text and the N candidate action texts.

12. The method of claim 5, wherein the second sample prediction data comprises prediction instruction text for motion prediction of the second sample object;

training a second initial model based on the N candidate action texts and the second sample text to obtain a second target model, wherein the training comprises the following steps:

invoking a second initial model, and inputting the prediction instruction text, the N candidate action texts and the second sample text into the second initial model;

Determining a sample prediction action text corresponding to the second sample object from the N candidate action texts through the second initial model;

and training the second initial model based on the sample prediction action text and a second model convergence condition associated with the second initial model to obtain a second target model.

13. The method of claim 12, wherein the sample predictive action text comprises K characters; k is a positive integer;

training the second initial model based on the sample prediction action text and a second model convergence condition associated with the second initial model to obtain a second target model, including:

acquiring character generation probability of each character in the K characters;

determining a second model loss of the second initial model based on the K character generation probabilities;

training the second initial model based on the second model loss to obtain a third model training result;

and if the third model training result indicates that the trained second initial model meets the second model convergence condition associated with the second initial model, determining a second target model according to the second initial model meeting the second model convergence condition.

14. A data processing apparatus, comprising:

the selection module is used for respectively acquiring text matching degrees between M action texts in the action library and the text to be predicted, and selecting N action texts corresponding to the text to be predicted from the M action texts based on the M text matching degrees; m is a positive integer; the N is a positive integer less than or equal to M;

the generating module is used for carrying out action prediction on the virtual object based on the N action texts and the text to be predicted, and generating a predicted action text corresponding to the text to be predicted; the predicted action text belongs to the N action texts; the predicted action text is used for driving the virtual object to execute the action corresponding to the predicted action text.

15. A data processing apparatus, comprising:

the first sample acquisition module is used for acquiring a first sample text corresponding to a first sample object and first sample prediction data corresponding to the first sample text; the first sample prediction data is constructed based on the correct action text of the first sample object and the first sample text; the correct action text of the first sample object is determined from an action library for action prediction;

the second sample acquisition module is used for acquiring a second sample text corresponding to a second sample object and second sample prediction data corresponding to the second sample text; the second sample prediction data is constructed based on N candidate action texts corresponding to the second sample text and the second sample text; the N candidate action texts are determined from an action library for action prediction; n is a positive integer;

the second training module is used for training a second initial model based on the N candidate action texts and the second sample text to obtain a second target model; the first target model and the second target model are used for jointly predicting a predicted action text corresponding to the virtual object; the predicted action text is used for driving the virtual object to execute the action corresponding to the predicted action text.

16. A computer device, comprising: a processor and a memory and a network interface;

The processor is connected to the memory and the network interface, wherein the network interface is configured to provide a data communication function, the memory is configured to store a computer program, and the processor is configured to invoke the computer program to cause the computer device to perform the method of any of claims 1 to 13.

17. A computer readable storage medium, characterized in that the computer readable storage medium has stored therein a computer program adapted to be loaded and executed by a processor to cause a computer device having the processor to perform the method of any of claims 1 to 13.

18. A computer program product, characterized in that it comprises a computer program stored in a computer readable storage medium, which computer program is adapted to be read and executed by a processor to cause a computer device with the processor to perform the method of any one of claims 1 to 13.