CN117315101A

CN117315101A - Virtual object action generation method and device and electronic equipment

Info

Publication number: CN117315101A
Application number: CN202311149221.0A
Authority: CN
Inventors: 贺杰; 胡永涛; 戴景文
Original assignee: Guangdong Virtual Reality Technology Co Ltd
Current assignee: Guangdong Virtual Reality Technology Co Ltd
Priority date: 2023-09-06
Filing date: 2023-09-06
Publication date: 2023-12-29

Abstract

The embodiment of the application provides a virtual object action generation method and device and electronic equipment. The method for generating the virtual object action comprises the following steps: acquiring to-be-processed data generated by a target object, wherein the to-be-processed data comprise to-be-processed voice data and to-be-processed facial images; acquiring prompt information corresponding to the data to be processed; inputting the data to be processed and the prompt information into a pre-trained action generation model, and acquiring target action data corresponding to the data to be processed output by the action generation model; and controlling the virtual object to execute the action corresponding to the target action data. Compared with the prior art that actions are generated only through text or visual data, the method and the device have the advantages that the actions are generated through the language data to be processed of the target object, the face image and the prompt information, the actual use scene is more attached, and the use experience of a user is improved.

Description

Virtual object action generation method and device and electronic equipment

Technical Field

The application belongs to the field of computers, and particularly relates to a virtual object action generation method and device and electronic equipment.

Background

In the related art, a text prompt is generally generated according to received visual information or text data, then the text prompt is input into a large language model, then a text output of the large language model is obtained, and the text output is provided to an artificial intelligence, so that the artificial intelligence selects an action according to the environmental state in which the artificial intelligence is located and the received text output. However, the action generated by the text cannot be communicated with the user, namely the user cannot influence the action of artificial intelligence output through language, so that the method is not close to a real scene.

Disclosure of Invention

In view of the above problems, embodiments of the present application provide a method and apparatus for generating a virtual object action, and an electronic device, so as to improve the above problems.

In a first aspect, an embodiment of the present application provides a method for generating a virtual object action, where the method includes: acquiring to-be-processed data generated by a target object, wherein the to-be-processed data comprise to-be-processed voice data and to-be-processed facial images; acquiring prompt information corresponding to the data to be processed, wherein the prompt information comprises character setting rules; inputting the data to be processed and the prompt information into a pre-trained action generation model, and acquiring target action data corresponding to the data to be processed output by the action generation model; and controlling the virtual object to execute the action corresponding to the target action data.

In a second aspect, an embodiment of the present application provides a virtual object action generating apparatus, where the apparatus includes: the system comprises a data acquisition unit to be processed, a prompt information acquisition unit, a target action data acquisition unit and an action execution unit. The device comprises a data acquisition unit to be processed, a data processing unit and a processing unit, wherein the data acquisition unit to be processed is used for acquiring data to be processed generated by a target object, and the data to be processed comprises voice data to be processed and facial images to be processed; the prompt information acquisition unit is used for acquiring prompt information corresponding to the data to be processed, wherein the prompt information comprises character setting rules; the target action data acquisition unit is used for inputting the data to be processed and the prompt information into a pre-trained action generation model and acquiring target action data corresponding to the data to be processed output by the action generation model; and the action execution unit is used for controlling the virtual object to execute the action corresponding to the target action data.

In a third aspect, embodiments of the present application provide an electronic device including one or more processors and a memory; one or more programs, wherein the one or more programs are stored in the memory and configured to be executed by the one or more processors, the one or more programs configured to perform the methods described above.

In a fourth aspect, embodiments of the present application provide a computer readable storage medium having program code stored therein, wherein the above-described method is performed when the program code is run.

The embodiment of the application provides a virtual object action generation method and device and electronic equipment. The virtual object action generating method comprises the following steps: the method and the device have the advantages that the prompt information, the voice data to be processed and the facial image to be processed, which are generated by the target object, are input into the action generation model, the action generation model outputs target behavior data, and compared with the prior art that actions are generated only through text or visual data, the method and the device generate actions by using the voice data to be processed, the facial image and the prompt information of the target object, are more fit with actual use scenes, and improve user experience.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flowchart of a method for generating a virtual object action according to an embodiment of the present application;

FIG. 2 is a flowchart of a method for generating a virtual object action according to another embodiment of the present application;

FIG. 3 is a flowchart illustrating a method for generating a virtual object action according to still another embodiment of the present application;

FIG. 4 is a flowchart illustrating a method for generating a virtual object action according to still another embodiment of the present application;

FIG. 5 is a schematic diagram illustrating a method for generating a virtual object action according to still another embodiment of the present application;

FIG. 6 is a flowchart of a method for generating a virtual object action according to still another embodiment of the present application;

FIG. 7 is a block diagram illustrating a method for generating actions of virtual objects according to still another embodiment of the present application;

FIG. 8 is a block diagram illustrating a method for generating actions of virtual objects according to still another embodiment of the present application;

FIG. 9 shows a block diagram of an electronic device for performing the virtual object action generation method of the embodiments of the present application in real time;

fig. 10 illustrates a storage unit for storing or carrying program codes implementing a virtual object action generating method according to an embodiment of the present application in real time.

Detailed Description

The following description of the embodiments of the present application will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are only some, but not all, of the embodiments of the present application. All other embodiments, which can be made by one of ordinary skill in the art based on the embodiments herein without making any inventive effort, are intended to be within the scope of the present application.

It should be noted that the terms "first," "second," and the like in the description and claims of the present application and the above figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that embodiments of the present application described herein may be implemented in other sequences than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or server that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed or inherent to such process, method, article, or apparatus, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

In a police exercise scene or an industry training scene, people experience can be improved through repeated exercise, and if a real person is used for exercise, excessive human resources can be consumed, so that the effect of exercise can be achieved by interaction with virtual objects in a virtual police exercise scene through virtual reality equipment.

The inventor finds that in the research of the related virtual object action generating method, the related virtual object action generating method generally generates a text prompt according to received visual information or text information by the electronic equipment, then inputs the text prompt into a large language model, then provides text output of the large language model for artificial intelligence, and selects actions according to the current environment and the text output by the artificial intelligence, but in an actual application scene, a user interacts through voice, so that the action is not fit with the actual scene enough only through text information selection.

Therefore, the inventor provides a virtual object action generation method, a virtual object action generation device and electronic equipment in the embodiment of the application. The virtual object action generating method comprises the following steps: acquiring to-be-processed data generated by a target object, wherein the to-be-processed data comprise to-be-processed voice data and to-be-processed facial images; acquiring prompt information corresponding to the data to be processed; inputting the data to be processed and the prompt information into a pre-trained action generation model, and acquiring target action data corresponding to the data to be processed output by the action generation model; and controlling the virtual object to execute the action corresponding to the target action data. Compared with the prior art that actions are generated only through text or visual data, the method and the device have the advantages that the actions are generated through the language data to be processed of the target object, the face image and the prompt information, the actual use scene is more attached, and the use experience of a user is improved.

Embodiments of the present application will be described in detail below with reference to the accompanying drawings.

Referring to fig. 1, an embodiment of the present application provides a method for generating a virtual object action, where the method includes:

step S110: and acquiring to-be-processed data generated by the target object, wherein the to-be-processed data comprise to-be-processed voice data and to-be-processed facial images.

In this embodiment of the present application, an electronic device executes an action generating model in response to an open command, where the open command may be a command generated by the electronic device according to receiving an open voice sent by a target object or a command generated by the target object triggering an open button on the electronic device, the open voice is a voice sent according to a preset open word, the open word is a word corresponding to the action generating model, for example, the open word is "drill start", and after the target object sends the open voice corresponding to the open word, the electronic device executes the action generating model, where the action generating model is a model for processing to-be-processed voice data and to-be-processed facial image data, so as to generate action data. The electronic equipment can acquire the voice data to be processed sent out by the current moment of the target object through the microphone, can acquire an image of the target object through the image acquisition device, and takes an image acquisition result as a face image to be processed of the current moment of the target object, wherein the image acquisition device can be a camera. The electronic device may be a virtual reality device, the voice data to be processed is voice data sent by a target object, the face image to be processed is an image of a face of the target object, the target object is an object using the electronic device, and obviously, the number of the target objects may be one or more.

In the practical application process, the electronic equipment can start the voiceprint mode, so that the electronic equipment can only accept the voice data to be processed generated by one target object, and can not acquire voice data of other people due to external noise interference in the process of acquiring the data to be processed, thereby ensuring that the voice data to be processed is provided by all the target objects. The voiceprint mode is a mode of only receiving voice sent by a specific voiceprint object, and the voiceprint object is an object for inputting voiceprints into the electronic equipment.

As one way, the target object may be defined for use in a preset scenario, which may include police, military, industry training, and game training scenarios, without specific limitation herein. For example, when the target object is in the police exercise scene, the use of the action generation model requires explicit start conditions, pause conditions, resume conditions, and end conditions, and therefore, when the electronic device receives an on instruction issued by the target object, the action generation model is operated so that the action generation model works. And simultaneously, starting to acquire the voice data to be processed and the face image to be processed of the target object in the police training scene, and taking the voice data to be processed and the face image to be processed as the data to be processed.

Step S120: and acquiring prompt information corresponding to the data to be processed, wherein the prompt information comprises character background setting rules.

In the embodiment of the application, since the action generation model includes a large language model, prompt information needs to be input into the action generation model. After the target object designs the prompt information, the electronic device obtains the prompt information and transmits the prompt information to the action generation model, wherein the input mode of the prompt information can be a voice mode or a typing mode in a typing frame, the specific limitation is not made here, and the prompt information is information which is input by the target object and is used for guiding an output result of the action generation model. The prompt information includes a character background setting rule, which is a rule for setting a character background of the virtual object, for example, the character background set for the virtual object may be "aged men, dysphoria, and liking.

As one way, the hint information is not fixed and may be changed after the motion generation model completes outputting the target motion data.

Step S130: inputting the data to be processed and the prompt information into a pre-trained action generation model, and obtaining target action data corresponding to the data to be processed, which is output by the action generation model.

In the embodiment of the application, after the electronic device acquires the to-be-processed data and the prompt information of the target object, the to-be-processed data and the prompt information are input into the action generation model trained in advance, and the action generation model processes the to-be-processed data according to the input prompt information, so that target action data corresponding to the to-be-processed data is obtained. According to the difference of the input prompt information, even if the same data to be processed is input, the target action data output by the action generating model may have a difference.

In an embodiment of the present application, a training process for generating a model for an action includes:

step S131: the method comprises the steps of obtaining a training data set, wherein the training data set comprises a plurality of prompt messages, a plurality of text data and a plurality of expression data, the text data are text data generated by different objects collected under a plurality of application scenes, the expression data are expression data generated by different objects collected under a plurality of application scenes, the prompt messages are rules set based on the text data and the expression data, and the prompt messages comprise at least one of character background setting rules, talking setting rules, state setting rules, mood setting rules and consciousness variable changing rules.

In this embodiment of the present application, the training dataset includes a plurality of prompt messages, a plurality of text data, and a plurality of expression data, where the plurality of text data is text data generated by different objects collected in a plurality of application scenarios, and the plurality of expression data is expression data generated by different objects collected in a plurality of application scenarios, and the prompt messages include at least one of character background setting rules, conversation setting rules, state setting rules, mood setting rules, and consciousness variable changing rules. The character background setting rule is a rule for setting a character background of the virtual object, the character background may include characters, ages, professions and likes, the conversation setting rule is a rule for setting a length of a conversation controlled by a conversation template, the state setting rule is a rule for setting an initial emotional state of the virtual object, the mood setting rule is a rule for setting a return language style of the virtual object, and the consciousness variable changing rule is a rule corresponding to the condition when the consciousness variable is generated. The electronic device may obtain training data from publicly available web pages, news, social media, or may obtain training data using an existing open source database, which is not specifically limited herein.

Step S132: and preprocessing the training data set to obtain a preprocessed training data set.

In the embodiment of the application, in order to improve the quality and consistency of data so as to train a more accurate and stable model, a training data set needs to be preprocessed. The preprocessing mode may be that language identification and filtering are performed on the training data set, word segmentation is performed on the filtered training data set to obtain a plurality of words, the words are respectively converted into vectors, and model calculation is facilitated, wherein the language identification and filtering are mainly used for identifying target languages and filtering non-target languages, for example, if the target language is set to be English, data which is not English is filtered. After the training data set is obtained, the preprocessing operation is carried out on the training data set, so that the preprocessed training data set is obtained.

Step S133: and inputting the preprocessed training data set into a model to be trained, and obtaining action data corresponding to the preprocessed training data set through forward propagation.

In the embodiment of the application, the model to be trained is a model constructed based on a transducer model and used for processing texts. Before the training data set is input into the model to be trained, an initialization operation is performed on model parameters of the model to be trained, and the initialization operation of the model parameters can be random. After the initialization operation of the model parameters is completed, the preprocessed training data set is input into the model to be trained, and the input preprocessed training data set is converted into corresponding action data in a forward propagation mode.

Step S134: and obtaining a loss function value based on the action data.

In this embodiment of the present application, after the to-be-trained model determines the action data corresponding to the training data set, the loss function value may be obtained by calculating a preset loss function, where the preset loss function may be a logistic regression loss function, a binary class loss function, or the like, which is not specifically limited herein.

Step S135: and carrying out iterative training on the model to be trained based on the loss function value until the training ending condition is met, so as to obtain the action generating model.

In the embodiment of the application, in the process of performing iterative training on the model to be trained, the loss function value gradually decreases, and when the loss function value decreases to a preset value, the condition that the training end condition is met at the moment is determined, so that the model training end can be judged, and the action generating model is obtained. Illustratively, when the loss function value drops to 0.05, the model training is judged to be completed, and an action generating model is obtained.

As one way, the training time of the model to be trained may be used as the training end condition. For example, performing iterative training on the model to be trained until the model to be trained is trained for 10 times, and determining that the training ending condition is met at the moment to obtain the action generating model.

Step S140: and controlling the virtual object to execute the action corresponding to the target action data.

In the embodiments of the present application, a virtual object refers to a digital person generated by a virtual reality device. After target action data corresponding to the data to be processed are determined, controlling the virtual object to execute corresponding actions according to the target action data, for example, driving eyeballs according to eye data included in the target action data, and controlling pupil constriction; sending out corresponding voice according to voice data included in the target action data, and simultaneously controlling lips to make corresponding lip shape change according to the sent voice; the control face makes a corresponding expression or the like according to expression data included in the target motion data, and is not particularly limited herein.

As one way, since the process of processing the data to be processed and the prompt information by the action generation model requires a certain time, one expression data to be processed is set in advance for the virtual object, and each time the action generation model processes the data to be processed, the virtual object executes the action corresponding to the expression data to be processed.

According to the virtual object action generating method, firstly, to-be-processed data generated by a target object are obtained, the to-be-processed data comprise to-be-processed voice data and to-be-processed facial images, then prompt information corresponding to the to-be-processed data is obtained, the prompt information comprises character setting rules, the to-be-processed data and the prompt information are input into a pre-trained action generating model, target action data corresponding to the to-be-processed data output by the action generating model are obtained, and finally, the virtual object is controlled to execute actions corresponding to the target action data. Compared with the prior art that actions are generated only through text or visual data, the method and the device have the advantages that the actions are generated through the language data to be processed of the target object, the face image and the prompt information, the actual use scene is more attached, and the use experience of a user is improved.

Referring to fig. 2, an embodiment of the present application provides a method for generating a virtual object action, where the method includes:

step S210: and acquiring to-be-processed data generated by the target object, wherein the to-be-processed data comprise to-be-processed voice data and to-be-processed facial images.

Step S220: and acquiring prompt information corresponding to the data to be processed, wherein the prompt information comprises character setting rules.

The specific explanation of step S210 to step S220 in the above embodiment can be referred to, so that the details are not repeated in this embodiment.

Step S230: and performing voice conversion on the voice data to be processed to obtain text data to be processed corresponding to the voice data to be processed.

In the embodiment of the application, after the voice data to be processed is obtained, voice recognition (Automatic Speech Recognition) is performed on the voice data to be processed, the obtained voice data to be processed is converted into text, the text is converted into a structural representation which can be understood by the electronic equipment through natural language understanding (Natural Language Processing), the voice conversion operation is determined to be completed, and the obtained structural representation which can be understood by the electronic equipment is used as the text data to be processed.

As a way, the operation of voice conversion of the voice data to be processed can be integrated into a language conversion module, so that the voice data to be processed can be input into the voice conversion module, and the voice conversion module performs voice conversion on the voice data to be processed, so that the text data to be processed corresponding to the voice data to be processed is output and obtained.

Step S240: and carrying out expression recognition on the facial image to be processed to obtain expression data to be processed corresponding to the facial image to be processed.

In this embodiment of the present application, after a face image to be processed is obtained, a preprocessing operation is performed on the face image to be processed, where the preprocessing operation may include normalization of a size and a gray level of the face image to be processed, correction of a head pose, image segmentation, and the like, and then feature extraction is performed on the face image to be processed after preprocessing, and expression recognition is performed according to the extracted features, where a feature extraction manner may include geometric feature extraction, overall statistical feature extraction, and feature extraction based on a frequency domain, which is not specifically limited herein. Taking geometric feature extraction as an example, positioning and measuring the position change of the obvious features in the facial image to be processed, determining the size, distance, shape, mutual proportion and other features of the obvious features, performing expression recognition according to the determined features, and obtaining the expression data to be processed corresponding to the facial image to be processed after the expression recognition is completed. Among them, the salient features may include eyes, eyebrows, nose, mouth, etc., and are not particularly limited herein.

As a way, the operation of performing the expression recognition on the facial image to be processed can be integrated into one expression module, and then the facial image to be processed can be input into the expression module, and the expression module performs the expression recognition on the facial image to be processed, so that the expression data to be processed corresponding to the facial image to be processed is output.

Step S250: inputting the text data to be processed, the expression data to be processed and the prompt information into the action generation model, and obtaining target action data corresponding to the data to be processed output by the action generation model.

In the embodiment of the application, after determining to obtain the text data to be processed and the expression data to be processed, inputting the text data to be processed, the expression data to be processed and the prompt information into the action generation model, and processing the text data to be processed and the expression data to be processed according to the prompt information by the action generation model so as to output and obtain target action data.

For example, in a training scene for police, the action generating model obtains prompt information that a virtual object plays an old man, a character is violent, a person who is a friend of the person is witnessed, the person who is a person who is deliberately injured is hidden, at the moment, the person who is injured is receiving the inquiry of police at home, a target object plays a police, at the moment, the voice to be processed sent by the target object is "we hear you at the moment in the scene of the case, ask you see who is the person who is injured, at the moment, meanwhile, the facial expression is a serious expression, the electronic device obtains text data to be processed after voice conversion of the voice to be processed, at the same time, expression recognition is carried out on the facial expression to be processed to obtain expression data to be processed, the text data to be processed and the expression data to be processed are input into the action generating model, the action generating model processes the text data to be processed according to the prompt information, and the expression data to be processed, and the target action data possibly indicates that the virtual object is not very uncomfortable. Obviously, if the prompt information of the virtual object is changed, the target action data output by the action generating model may be different, for example, the prompt information obtained by the action generating model is that the virtual object plays an old man, is character and charm, witnessed a deliberate hurt case, and is in the house to receive the inquiry of police at this time, after the text data to be processed and the expression data to be processed are input into the action generating model, the target action data output by the action generating model may indicate that the virtual object knows the current situation with the honest expression, and the user is informed.

Step S260: and controlling the virtual object to execute the action corresponding to the target action data.

Step S260 may be specifically explained with reference to the above embodiments, so that details are not repeated in this embodiment.

According to the virtual object action generating method, firstly, to-be-processed data generated by the target object in a police exercise scene is obtained, then prompt information corresponding to the to-be-processed data is obtained, the prompt information comprises character setting rules, then voice conversion is carried out on the to-be-processed voice data to obtain to-be-processed text data corresponding to the to-be-processed voice data, meanwhile, expression recognition is carried out on the to-be-processed facial image to obtain to-be-processed expression data corresponding to the to-be-processed facial image, the to-be-processed text data, the to-be-processed expression data and the prompt information are input into the action generating model, target action data corresponding to the to-be-processed data output by the action generating model is obtained, and finally, the virtual object is controlled to execute actions corresponding to the target action data. Compared with the prior art that actions are generated only through text or visual data, the method and the device have the advantages that the actions are generated through the to-be-processed language data of the target object, the face image and the prompt information, the actual scenes of police exercises are more attached, and the use experience of users is improved.

Referring to fig. 3, an embodiment of the present application provides a method for generating a virtual object action, where the method includes:

step S310: and acquiring data to be processed generated by the target object in the police drilling scene.

Step S320: and acquiring prompt information corresponding to the data to be processed, wherein the prompt information comprises character setting rules.

Step S330: and performing voice conversion on the voice data to be processed to obtain text data to be processed corresponding to the voice data to be processed.

Step S340: and carrying out expression recognition on the facial image to be processed to obtain expression data to be processed corresponding to the facial image to be processed.

The specific explanation of step S310 to step S340 in the above embodiment can be referred to, so that the details are not repeated in this embodiment.

Step S350: and inputting the text data to be processed and the prompt information into the first prediction module to acquire target behavior data output by the first prediction module.

In the embodiment of the application, text data to be processed and prompt information are input into a first pre-processing module included in an action generation model, and the first prediction module analyzes and processes the text data to be processed according to the prompt information, so that target behavior data is output and obtained.

Step S360: and inputting the text data to be processed into the second prediction module, and obtaining target voice data output by the second prediction module.

In the embodiment of the application, the text data to be processed is input into the second prediction module, and the second prediction module analyzes and processes the text data to be processed, so that the target voice data is output and obtained.

Step S370: and inputting the expression data to be processed into the first prediction module or the second prediction module, and obtaining target expression data output by the first prediction module or the second prediction module.

In the embodiment of the application, in the first prediction module and the second prediction module, a common prediction unit exists, expression data to be processed is input into the prediction unit included in the first prediction module or the second prediction module, the expression data to be processed is analyzed through the prediction unit, and the target expression data is obtained through output by combining prompt information and the text data to be processed. Wherein the prediction unit may be a large language model.

Step S380: and taking the target behavior data, the target voice data and the target expression data as the target action data.

In the embodiment of the application, the acquired target behavior data, target voice data and target expression data are integrated, and the integrated result is used as target action data.

Step S390: and controlling the virtual object to execute the action corresponding to the target action data.

In this embodiment of the present application, the facial expression module, the behavior execution module, and the voice module of the virtual object may be set to be independent, so after the target motion data is obtained, the facial expression module is controlled to execute the expression motion corresponding to the target expression data included in the target motion data, the behavior execution module is controlled to execute the behavior motion corresponding to the target behavior data, and the voice module is controlled to execute the voice motion corresponding to the target voice data, and the expression motion, the behavior motion, and the voice motion are integrated as the motion of the virtual object. The facial expression module is used for representing a module for executing target expression data, the behavior execution module is used for representing a module for executing target behavior data, the voice module is used for representing a module for executing target voice data, the expression action is used for representing the expression displayed by the virtual object, the behavior action is used for representing the behavior displayed by the virtual object, and the voice action is used for representing voice sent by the virtual object.

In one manner, when the target object interacts with the virtual object, the electronic device stores the interaction process of the target object and the virtual object after the interaction of the target object is suspended. And in the time period when the target object pauses to interact with the virtual object, allowing another user to use the electronic device to interact with the virtual object, and if the user stops interacting with the virtual object or finishes interacting with the virtual object, when the target object restarts interacting with the virtual object, calling a stored interaction process by the electronic device, so that the interaction process of the target object and the virtual object is seamlessly connected, and the use experience of the user can be improved.

According to the virtual object action generating method, firstly, text data to be processed and prompt information are input into a first prediction module, target action data output by the first prediction module is obtained, then the text data to be processed is input into a second prediction module, target voice data output by the second prediction module is obtained, then expression data to be processed is input into a first prediction model or a second prediction model, target expression data output by the first prediction module or the second prediction module is obtained, then the obtained target action data, target voice data and target expression data are used as target action data, and actions corresponding to the target action data are controlled to be executed by the virtual object. Compared with the prior art that actions are generated only through text or visual data, the method and the device have the advantages that the actions are generated through the to-be-processed language data of the target object, the face image and the prompt information, the actual scenes of police exercises are more attached, and the use experience of users is improved.

Referring to fig. 4, an embodiment of the present application provides a method for generating a virtual object action, where the method includes:

step S401: and acquiring data to be processed generated by the target object in the police drilling scene.

Step S402: and acquiring prompt information corresponding to the data to be processed, wherein the prompt information comprises character setting rules.

Step S403: and performing voice conversion on the voice data to be processed to obtain text data to be processed corresponding to the voice data to be processed.

Step S404: and carrying out expression recognition on the facial image to be processed to obtain expression data to be processed corresponding to the facial image to be processed.

The specific explanation of step S401 to step S405 in the above embodiment can be referred to, so that the details are not repeated in this embodiment.

Step S405: and inputting the text data to be processed and the prompt information into the large language model to acquire conscious variables of the large language model data.

In the embodiment of the application, after the text data to be processed is obtained, the text data to be processed and the prompt information are input into a large language model in the first prediction module, the large language model processes the text data to be processed according to the prompt information, the text data to be processed is understood through the large language model, and the current state of the current target object is determined, so that the conscious variable is output according to the prompt information. For example, in a police practice scene, the prompt message is set as "virtual object character is violent", so that passers-by who deliberately hurt people together are seen, but the passers-by is his friends, so that people want to hide the true looks, and are now receiving inquiries at home. The prompt information and the text data to be processed are input into a large language model, the large language model understands the text data to be processed, and determines the meaning of the expression of the text to be processed to be "if you don't say, you are in deliberate refuge", at this time, the consciousness variable output by the large language model according to the prompt information may be the consciousness variable expressing anger emotion.

Step S406: and inputting the conscious variable into the behavior selection module, and acquiring the target behavior data output by the behavior selection module.

In the embodiment of the application, the consciousness variable is input into the behavior selection module, the behavior selection module inputs the acquired consciousness variable into the personally consciousness variable in the presence of the current personally consciousness variable of the virtual object, so that the personally consciousness variable is updated, the updated personally consciousness variable is obtained, the behavior selection module selects target behavior data corresponding to the updated personally consciousness variable from the preset behavior selection library according to the updated personally consciousness variable, and the target behavior data are subjected to data. Wherein the character awareness variable is used to characterize awareness of the virtual object in response to speech of the target object, the selected target behavioral data corresponding to the awareness, the awareness generated prior to execution of the target behavioral data. For example, as described in step S405, the awareness variable output at this time may be an awareness variable expressing anger emotion, and the awareness variable is input to the behavior selection module, and at this time, the behavior selection module updates the person awareness variable according to the awareness variable, and the updated person awareness variable may represent that the virtual object is about to take a table to express anger emotion, so that the behavior selection module selects the target behavior data corresponding to the taken table from the behavior selection library.

Step S407: inputting the text data to be processed into the large language model, and obtaining the language variable and the target text data output by the large language model.

In the embodiment of the application, after voice conversion is performed on voice data to be processed to obtain text data to be processed, the text data to be processed is input into a large language model, and a language gas variable and target text data are output by the large language model according to prompt information, wherein the language gas variable is language gas determined according to the prompt information. For example, in a training scene for police, the action generating model acquires prompt information that a virtual object plays an old man, a character is violent, a person who is injured is witnessed for one time, but the person who is injured is friends of the person, people want to hide true looks, the inquiry of police is received at home, text data to be processed at the moment is that we know you are on site at the moment and trouble you are matched with our investigation, a large language model is combined with the prompt information, the generated target text data is probably that "I do not know what you are at all", and the generated language variable is the impatient language.

Step S408: and inputting the mood variable and the target text data into the voice generating module to acquire the target voice data output by the voice generating module.

In the embodiment of the application, the language gas variable and the target text data are input into the voice generation module, the voice generation module recognizes the target text data and performs text conversion to obtain the reference voice data, meanwhile, the language gas variable and the reference voice data are combined to obtain the target voice data, and when the virtual object executes the voice action corresponding to the target voice data, the language gas is included in the voice action at the moment, so that the interaction process is more real, and the user experience is improved. The reference voice data is voice data obtained by performing text conversion on target text data.

Step S409: inputting the expression data to be processed into a large language model included in the first prediction module or the second prediction module, and obtaining the target expression data output by the large language model.

In the embodiment of the application, after facial expression recognition is performed on a facial image to be processed to obtain facial expression data to be processed, word segmentation is performed on text data to be processed through a large language model, so that a plurality of keywords included in the text data to be processed are obtained, the keywords are analyzed, meanwhile, the facial image to be processed is analyzed, the state of a current target object is determined, and therefore the target expression data is output.

Step S410: and controlling the virtual object to execute the action corresponding to the target action data.

Step S410 may be specifically explained with reference to the above embodiments, so that details are not repeated in this embodiment.

As shown in fig. 5, step S401 to step S410 may be performed, where the prompt information includes a character background setting rule, a conversation setting rule, a state setting rule, a mood setting rule, and a consciousness variable changing rule, and first, the electronic device responds to an opening instruction to obtain voice data to be processed and a face image to be processed of the target object, performs voice recognition on the voice data to be processed through a voice conversion module, performs voice-to-text operation on the recognized result, thereby obtaining text data to be processed, simultaneously performs expression recognition on the face image to be processed through an expression module, thereby obtaining expression data to be processed, and inputs the prompt information, the text data to be processed, and the expression data to be processed into a large language model, processes the prompt information, the text data to be processed, and the expression data to be processed, and then inputs the consciousness variable into a behavior selection module, outputs the target behavior data, and simultaneously inputs the mood variable and the target text data into a voice generation module, performs voice-to perform text-to-text operation on the recognized result, and adds the manipulated result to the voice data to obtain the target speech variable.

According to the virtual object action generating method, firstly, text data to be processed and prompt information are input into a large language model, conscious variables of the large language model data are obtained, then the conscious variables are input into a behavior selection module, target behavior data output by the behavior selection module are obtained, the text data to be processed is input into the large language model, the mood variables and the target text data output by the large language model are obtained, then the mood variables and the target text data are input into a voice generating module, target voice data output by the voice generating module are obtained, and then expression data to be processed are input into the large language model included in a first prediction module or a second prediction module, and target expression data output by the large language model are obtained. Compared with the prior art that actions are generated only through text or visual data, the method and the device have the advantages that the actions are generated through the to-be-processed language data of the target object, the face image and the prompt information, the actual scenes of police exercises are more attached, and the use experience of users is improved.

Referring to fig. 6, an embodiment of the present application provides a method for generating a virtual object action, where the method includes:

Step S501: and acquiring data to be processed generated by the target object in the police drilling scene.

Step S502: and acquiring prompt information corresponding to the data to be processed, wherein the prompt information comprises character setting rules.

Step S503: and performing voice conversion on the voice data to be processed to obtain text data to be processed corresponding to the voice data to be processed.

Step S504: and carrying out expression recognition on the facial image to be processed to obtain expression data to be processed corresponding to the facial image to be processed.

The specific details of step S501 to step S504 can be referred to in the above embodiments, so that the details are not repeated in this embodiment.

Step S505: and inputting the text data to be processed into the large language model, and determining a plurality of keywords included in the text data to be processed through the large language model.

In the embodiment of the application, the text data to be processed is input into a large language model, and word segmentation operation is performed through the large language model, so that a plurality of keywords included in the text data to be processed are obtained.

Step S506: and if the large language model determines that any keyword in the keywords hits the conscious variable change item in the conscious variable change rule, acquiring the conscious variable output by the large language model, wherein the conscious variable is a variable of a state field corresponding to the hit conscious variable change item, and the conscious variable change item comprises a mapping relation between the keyword and the state field.

As one way, a conscious variable change rule in which a plurality of conscious variable change items each including a mapping relationship between a keyword and a status field and one keyword may relate to a plurality of conscious variable change items are included in the prompt information input to the large language model. If a key word hits a consciousness variable change item in the consciousness variable change rule in a plurality of key words obtained by the large language model, the large language model determines the change condition of the state field in the hit consciousness variable change rule according to the prompt information, so that the change condition of a plurality of state fields can be determined, and the change conditions of the plurality of state fields are integrated, so that the consciousness variable is obtained. Wherein the status field is used to characterize a status-related field of the virtual object, and the status may include blood volume, emotion, etc. of the virtual object.

Obviously, for different prompt messages, the change condition of the same state field may be different, for example, the keyword is "alarm", so that the state field representing emotion may be changed differently according to the different prompt messages, and if the character of the virtual object is set as violence in the prompt messages, the state field representing emotion may generate change of vigilance; if the behavior of the virtual object is set to be weak in the prompt information, the state field for representing emotion may generate fear change.

As another way, if the large language model determines that no consciousness variable change item in the keyword hit consciousness variable change rule exists in the generated plurality of keywords, the large language model does not output consciousness variables, so that the behavior selection module cannot output target behavior data, and the virtual object keeps the behavior action when the data to be processed of the target object is acquired; or presetting a default behavior action, and when determining that no keyword hits the consciousness variable change item in the consciousness variable change rule in the plurality of keywords, executing the default behavior action by the behavior execution module of the virtual object, and simultaneously, not affecting the voice action corresponding to the target voice data executed by the virtual object and the expression data corresponding to the target expression data executed by the virtual object.

Step S507: inputting the conscious variable into the behavior selection module, and acquiring the target behavior data output by the behavior selection module, wherein the target behavior data is obtained by updating the state field included in a preset state table by the behavior selection module based on the conscious variable.

In the embodiment of the application, the acquired conscious variable is input into the behavior selection module, the behavior selection module correspondingly updates a plurality of state fields included in the preset state table according to the change condition of the plurality of state fields included in the acquired conscious variable, so as to obtain an updated preset state table, and the behavior selection module determines a behavior action to be executed from the behavior selection library through the updated preset state table and outputs target behavior data associated with the behavior action.

Step S508: inputting the text data to be processed into the large language model, and obtaining the language variable and the target text data output by the large language model.

Step S509: and inputting the mood variable and the target text data into the voice generating module to acquire the target voice data output by the voice generating module.

Step S510: inputting the expression data to be processed into a large language model included in the first prediction module or the second prediction module, and obtaining the target expression data output by the large language model.

Step S511: and controlling the virtual object to execute the action corresponding to the target action data.

The details of step S508 to step S511 can be specifically explained in the above embodiments, so that the details are not repeated in this embodiment.

According to the virtual object action generating method, firstly, text data to be processed is input into a large language model, a plurality of keywords included in the text data to be processed are determined through the large language model, if the large language model determines that any keyword in the keywords hits a conscious variable change item in a conscious variable change rule, conscious variables output by the large language model are obtained, wherein the conscious variables are variables of state fields corresponding to the hit conscious variable change item, the conscious variable change item comprises mapping relations between the keywords and the state fields, then the conscious variables are input into a behavior selection module, target behavior data output by the behavior selection module are obtained, the target behavior data are obtained by updating the state fields included in a preset state table based on the conscious variables, the text data to be processed are input into the large language model, the atmosphere variables and the target text data output by the large language model are obtained, the atmosphere variables and the target text data are input into a voice generating module, the target voice data output by the voice generating module are obtained, and the target voice data output by the voice generating module are input into a first expression prediction module or a second prediction module, and the target language data output by the prediction module comprises the large language model. Compared with the prior art that actions are generated only through text or visual data, the method and the device have the advantages that the actions are generated through the to-be-processed language data of the target object, the face image and the prompt information, the actual scenes of police exercises are more attached, and the use experience of users is improved.

Referring to fig. 7, an embodiment of the present application provides a virtual object action generating apparatus 600, where the apparatus 600 includes:

the to-be-processed data acquiring unit 610 is configured to acquire to-be-processed data generated by the target object, where the to-be-processed data includes to-be-processed voice data and to-be-processed facial images.

As one way, the data to be processed acquiring unit 610 is further configured to acquire data to be processed generated by the target object in the police exercise scene.

The prompt information obtaining unit 620 is configured to obtain prompt information corresponding to the data to be processed, where the prompt information includes a person setting rule.

The target motion data obtaining unit 630 is configured to input the data to be processed and the prompt information into a pre-trained motion generation model, and obtain target motion data corresponding to the data to be processed output by the motion generation model.

As a way, the target action data obtaining unit 630 is further configured to perform voice conversion on the voice data to be processed, so as to obtain text data to be processed corresponding to the voice data to be processed; performing expression recognition on the facial image to be processed to obtain expression data to be processed corresponding to the facial image to be processed; inputting the text data to be processed, the expression data to be processed and the prompt information into the action generation model, and obtaining target action data corresponding to the data to be processed output by the action generation model.

Optionally, the target action data obtaining unit 630 is further configured to input the text data to be processed and the prompt information to the first prediction module, and obtain target action data output by the first prediction module; inputting the text data to be processed into the second prediction module, and obtaining target voice data output by the second prediction module; inputting the expression data to be processed into the first prediction module or the second prediction module, and obtaining target expression data output by the first prediction module or the second prediction module; and taking the target behavior data, the target voice data and the target expression data as the target action data.

Optionally, the target action data obtaining unit 630 is further configured to input the text data to be processed and the prompt information into the large language model, and obtain a conscious variable of the large language model data; and inputting the conscious variable into the behavior selection module, and acquiring the target behavior data output by the behavior selection module.

Optionally, the target action data obtaining unit 630 is further configured to input the text data to be processed into the large language model, and obtain a mood variable and target text data output by the large language model; and inputting the mood variable and the target text data into the voice generating module to acquire the target voice data output by the voice generating module.

Optionally, the target motion data obtaining unit 630 is further configured to input the expression data to be processed into a large language model included in the first prediction module or the second prediction module, and obtain the target expression data output by the large language model.

Optionally, the target action data obtaining unit 630 is further configured to input the text data to be processed into the large language model, and determine, through the large language model, a plurality of keywords included in the text to be processed; if the large language model determines that any keyword in the plurality of keywords hits a conscious variable change item in the conscious variable change rule, acquiring a conscious variable output by the large language model, wherein the conscious variable is a variable of a state field corresponding to the hit conscious variable change item, and the conscious variable change item comprises a mapping relation between the keyword and the state field; inputting the conscious variable into the behavior selection module, and acquiring the target behavior data output by the behavior selection module, wherein the target behavior data is obtained by updating the state field included in a preset state table by the behavior selection module based on the conscious variable.

And an action execution unit 640 for controlling the virtual object to execute an action corresponding to the target action data.

Referring to fig. 8, the apparatus 600 further includes:

model training unit 650: acquiring a training data set, wherein the training data set comprises a plurality of prompt messages, a plurality of text data and a plurality of expression data, the text data are text data generated by different objects collected in a plurality of application scenes, the expression data are expression data generated by different objects collected in a plurality of application scenes, the prompt messages are rules set based on the text data and the expression data, and the prompt messages comprise at least one of character background setting rules, talking setting rules, state setting rules, mood setting rules, consciousness variable changing rules and expression action setting rules; preprocessing the training data set to obtain a preprocessed training data set; inputting the preprocessed training data set into a model to be trained, and obtaining action data corresponding to the preprocessed training data set through forward propagation; obtaining a loss function value based on the motion data; and carrying out iterative training on the model to be trained based on the loss function value until the training ending condition is met, so as to obtain the action generating model.

It should be noted that, in the present application, the device embodiment and the foregoing method embodiment correspond to each other, and specific principles in the device embodiment may refer to the content in the foregoing method embodiment, which is not described herein again.

An electronic device provided in the present application will be described with reference to fig. 9.

Referring to fig. 9, based on the above-mentioned data processing method and apparatus, another electronic device 700 capable of executing the foregoing data processing method is further provided in the embodiments of the present application. The electronic device 700 includes one or more (only one shown) processors 702, memory 704, and a network module 706 coupled to each other. The memory 704 stores therein a program capable of executing the contents of the foregoing embodiments, and the processor 702 can execute the program stored in the memory 704.

Wherein the processor 702 may include one or more processing cores. The processor 702 utilizes various interfaces and lines to connect various portions of the overall electronic device 700, perform various functions of the server 700, and process data by executing or executing instructions, programs, code sets, or instruction sets stored in the memory 704, and invoking data stored in the memory 704. Alternatively, the processor 702 may be implemented in hardware in at least one of digital signal processing (Digital Signal Processing, DSP), field programmable gate array (Field-Programmable Gate Array, FPGA), programmable logic array (Programmable Logic Array, PLA). The processor 702 may integrate one or a combination of several of a central processing unit (Central Processing Unit, CPU), an image processor (Graphics Processing Unit, GPU), and a modem, etc. The CPU mainly processes an operating system, a user interface, an application program and the like; the GPU is used for being responsible for rendering and drawing of display content; the modem is used to handle wireless communications. It will be appreciated that the modem may not be integrated into the processor 702 and may be implemented solely by a single communication chip.

The Memory 704 may include a random access Memory (Random Access Memory, RAM) or a Read-Only Memory (Read-Only Memory). Memory 704 may be used to store instructions, programs, code, sets of codes, or instruction sets. The memory 704 may include a stored program area and a stored data area, wherein the stored program area may store instructions for implementing an operating system, instructions for implementing at least one function (e.g., a touch function, a sound playing function, an image playing function, etc.), instructions for implementing the various method embodiments described below, etc. The storage data area may also store data created by the electronic device 700 in use (e.g., phonebook, audiovisual data, chat log data), and the like.

The network module 706 is configured to receive and transmit electromagnetic waves, and implement mutual conversion between electromagnetic waves and electrical signals, so as to communicate with a communication network or other devices, such as an audio playback device. The network module 706 may include various existing circuit elements for performing these functions, such as an antenna, a radio frequency transceiver, a digital signal processor, an encryption/decryption chip, a Subscriber Identity Module (SIM) card, memory, and the like. The network module 706 may communicate with various networks such as the Internet, intranets, wireless networks, or with other devices via wireless networks. The wireless network may include a cellular telephone network, a wireless local area network, or a metropolitan area network. For example, the network module 706 may interact with base stations.

Referring to fig. 10, a block diagram of a computer readable storage medium according to an embodiment of the present application is shown. The computer readable storage medium 800 has stored therein program code that can be invoked by a processor to perform the methods described in the method embodiments above.

The computer readable storage medium 800 may be an electronic memory such as a flash memory, an EEPROM (electrically erasable programmable read only memory), an EPROM, a hard disk, or a ROM. Optionally, the computer readable storage medium 800 comprises a non-volatile computer readable medium (non-transitory computer-readable storage medium). The computer readable storage medium 800 has storage space for program code 810 that performs any of the method steps described above. The program code can be read from or written to one or more computer program products. Program code 810 may be compressed, for example, in a suitable form.

The embodiments of the present invention have been described above with reference to the accompanying drawings, but the present invention is not limited to the above-described embodiments, which are merely illustrative and not restrictive, and many forms may be made by those having ordinary skill in the art without departing from the spirit of the present invention and the scope of the claims, which are to be protected by the present invention.

Claims

1. A method of generating a virtual object action, the method comprising:

acquiring to-be-processed data generated by a target object, wherein the to-be-processed data comprise to-be-processed voice data and to-be-processed facial images;

acquiring prompt information corresponding to the data to be processed, wherein the prompt information comprises character setting rules;

inputting the data to be processed and the prompt information into a pre-trained action generation model, and acquiring target action data corresponding to the data to be processed output by the action generation model;

and controlling the virtual object to execute the action corresponding to the target action data.

2. The method according to claim 1, wherein the inputting the data to be processed and the prompt information into a pre-trained motion generation model, and obtaining target motion data corresponding to the data to be processed output by the motion generation model, includes:

Performing voice conversion on the voice data to be processed to obtain text data to be processed corresponding to the voice data to be processed;

performing expression recognition on the facial image to be processed to obtain expression data to be processed corresponding to the facial image to be processed;

inputting the text data to be processed, the expression data to be processed and the prompt information into the action generation model, and obtaining target action data corresponding to the data to be processed output by the action generation model.

3. The method of claim 2, wherein the action generation model comprises a first prediction model and a second prediction module; inputting the text data to be processed, the expression data to be processed and the prompt information into the action generation model, and obtaining target action data corresponding to the data to be processed output by the action generation model, wherein the method comprises the following steps:

inputting the text data to be processed and the prompt information into the first prediction module, and obtaining target behavior data output by the first prediction module;

inputting the text data to be processed into the second prediction module, and obtaining target voice data output by the second prediction module;

Inputting the expression data to be processed into the first prediction module or the second prediction module, and obtaining target expression data output by the first prediction module or the second prediction module;

and taking the target behavior data, the target voice data and the target expression data as the target action data.

4. The method of claim 3, wherein the first pre-processing module comprises a large language model and a behavior selection module; the step of inputting the text data to be processed and the prompt information to the first prediction module, and obtaining the target behavior data output by the first prediction module includes:

inputting the text data to be processed and the prompt information into the large language model to acquire conscious variables of the large language model data;

and inputting the conscious variable into the behavior selection module, and acquiring the target behavior data output by the behavior selection module.

5. The method of claim 3, wherein the second pre-processing module comprises a large language model and a speech generation module; inputting the text data to be processed into the second prediction module, and obtaining target voice data output by the second prediction module, wherein the method comprises the following steps:

Inputting the text data to be processed into the large language model, and obtaining the language variable and the target text data output by the large language model;

and inputting the mood variable and the target text data into the voice generating module to acquire the target voice data output by the voice generating module.

6. The method of claim 3, wherein the inputting the expression data to be processed into the first prediction module or the second prediction module, obtaining the target expression data output by the first prediction module or the second prediction module, comprises:

inputting the expression data to be processed into a large language model included in the first prediction module or the second prediction module, and obtaining the target expression data output by the large language model.

7. The method of claim 4, wherein the hint information includes a conscious variable change rule; inputting the text data to be processed and the prompt information into the large language model to obtain conscious variables of the large language model data, wherein the conscious variables comprise:

inputting the text data to be processed into the large language model, and determining a plurality of keywords included in the text to be processed through the large language model;

If the large language model determines that any keyword in the plurality of keywords hits a conscious variable change item in the conscious variable change rule, acquiring a conscious variable output by the large language model, wherein the conscious variable is a variable of a state field corresponding to the hit conscious variable change item, and the conscious variable change item comprises a mapping relation between the keyword and the state field;

the step of inputting the conscious variable into the behavior selection module and obtaining the target behavior data output by the behavior selection module comprises the following steps:

inputting the conscious variable into the behavior selection module, and acquiring the target behavior data output by the behavior selection module, wherein the target behavior data is obtained by updating the state field included in a preset state table by the behavior selection module based on the conscious variable.

8. The method according to claim 1, characterized in that the method is preceded by:

acquiring a training data set, wherein the training data set comprises a plurality of prompt messages, a plurality of text data and a plurality of expression data, the text data are text data generated by different objects collected in a plurality of application scenes, the expression data are expression data generated by different objects collected in a plurality of application scenes, the prompt messages are rules set based on the text data and the expression data, and the prompt messages comprise at least one of character background setting rules, talking setting rules, state setting rules, mood setting rules, consciousness variable changing rules and expression action setting rules;

Preprocessing the training data set to obtain a preprocessed training data set;

inputting the preprocessed training data set into a model to be trained, and obtaining action data corresponding to the preprocessed training data set through forward propagation;

obtaining a loss function value based on the motion data;

and carrying out iterative training on the model to be trained based on the loss function value until the training ending condition is met, so as to obtain the action generating model.

9. A virtual object action generating apparatus, the apparatus comprising:

the device comprises a to-be-processed data acquisition unit, a target object generation unit and a processing unit, wherein the to-be-processed data acquisition unit is used for acquiring to-be-processed data generated by the target object, and the to-be-processed data comprise to-be-processed voice data and to-be-processed facial images;

the prompt information acquisition unit is used for acquiring prompt information corresponding to the data to be processed, wherein the prompt information comprises character setting rules;

the target action data acquisition unit is used for inputting the data to be processed and the prompt information into a pre-trained action generation model and acquiring target action data corresponding to the data to be processed output by the action generation model;

And the action execution unit is used for controlling the virtual object to execute the action corresponding to the target action data.

10. An electronic device comprising one or more processors and a memory, one or more programs stored in the memory and configured to perform the method of any of claims 1-9 by the one or more processors.