CN117520498A

CN117520498A - Virtual digital human interaction processing method, system, terminal, equipment and medium

Info

Publication number: CN117520498A
Application number: CN202311473237.7A
Authority: CN
Inventors: 孙进
Original assignee: Yuncong Technology Group Co Ltd
Current assignee: Yuncong Technology Group Co Ltd
Priority date: 2023-11-07
Filing date: 2023-11-07
Publication date: 2024-02-06

Abstract

The invention discloses a virtual digital human interaction processing method, a system, a terminal, equipment and a medium, wherein the method comprises the following steps: receiving input voice information, converting the voice information into text information, and extracting features to obtain text feature vectors; inputting the text feature vector into a preset document vector library for matching, and outputting a document vector if the document vector associated with the text feature vector is matched in the preset document vector library; if the document vector associated with the text feature vector is not matched in the preset document vector library, responding to the text feature vector by using the artificial intelligence large model, and outputting a response result. Through the natural language processing capability that artificial intelligence big model possessed, promote virtual digital people's interactive experience, make it can more naturally carry out the dialogue with the user, simultaneously, ensure accurate output through vector matching, improved virtual digital people's interactive capability, can satisfy the user demand better, be applicable to more scenes.

Description

Virtual digital human interaction processing method, system, terminal, equipment and medium

Technical Field

The invention relates to the technical field of artificial intelligence, in particular to a virtual digital human interaction processing method, a virtual digital human interaction processing system, a virtual digital human interaction processing terminal, virtual digital human interaction processing equipment and a virtual digital human interaction processing medium.

Background

Virtual Digital people (Digital Human/Meta Human), which are Digital figures created by Digital technology and are close to Human figures, are increasingly presented with the rapid development of artificial intelligence technology, 5G communication technology and interpersonal interaction modes, and are applied and succeeded in a plurality of occasions.

Large models, which are artificial intelligence models trained with hundreds of billions (or more) of parameters, are rapidly being applied to various industries, for example, as represented by ChatGPT (Chat Generative Pre-trained Transformer, chat bot), accelerating the popularization of Artificial Intelligence (AI) technology, making AI an indispensable part of our work and life. Needless to say, these large model technologies have profoundly affected and changed the development of the industries, are reconstructing enterprise core products, and change the way users interact with enterprise products and services.

However, in the related art, the current virtual digital person can replace the standardized manual service in a large amount, but is a client model built based on the basis of a certain type of client, has a standard customer service question-answering library, has a standard business processing flow and the like, and has one-sided performance no matter which type of user is targeted, and lacks an interactive process based on the effective combination of a large model and the virtual digital person so as to improve the dialogue accuracy and interactive experience.

Disclosure of Invention

In view of the above-mentioned drawbacks of the prior art, the present invention aims to provide a virtual digital person interaction processing method, system, terminal, device and medium, which are used for solving the problem that the interactive experience and dialogue matching degree are insufficient due to the lack of effective combination between the prior art and the virtual digital person based on a large model.

In a first aspect, to achieve the above and other related objects, the present invention provides a virtual digital human interaction processing method, including: receiving input voice information, converting the voice information into text information, and extracting features to obtain text feature vectors; inputting the text feature vector into a preset document vector library for matching; if the document vector which is associated with the text feature vector is matched in the preset document vector library, outputting the document vector; if the document vectors associated with the text feature vectors are not matched in the preset document vector library, responding to the text feature vectors by using an artificial intelligent large model, and outputting a response result.

In some embodiments of the first aspect, the method further comprises the step of: preset appearance characteristic data, preset action characteristic data, preset expression characteristic data and preset voiceprint characteristic data, and correspondingly constructing at least one of the following models of the virtual digital person: and integrating at least one model to obtain the virtual digital person.

In some embodiments of the first aspect, an operation scheduling engine of the virtual digital person is constructed, the operation scheduling engine is utilized to identify emotion types of the text to be output, and emotion actions corresponding to the emotion types are determined, so that the virtual digital person executes corresponding expressions and actions.

In some embodiments of the first aspect, before the receiving the input voice information, the method further includes: if the preset voice information is monitored or the face image is detected, the virtual digital person is awakened, so that the virtual digital person enters a human-computer interaction state.

In some embodiments of the first aspect, after the face image is detected, the method further includes: comparing the face image with a preset portrait library to obtain user identity information; and if the user identity information is an acquaintance, controlling eyes of the virtual digital person to move along with the user, performing voice synthesis on a preset calling text by using a voice model to generate calling audio, identifying emotion types of the preset calling text, and calling the action model and the expression model to determine expression actions corresponding to the emotion types so as to enable the virtual digital person to execute corresponding expressions and actions and play the calling audio.

In some embodiments of the first aspect, before the text feature vector is input into a preset document vector library for matching, the method further includes: inputting the text feature vector into a preset sensitive word stock pair for matching; if the text feature vector contains sensitive words, identifying the types of the sensitive words in the text feature vector, and determining a preset sensitive text according to the types of the sensitive words so that a virtual digital person responds according to the preset sensitive text; and if the text feature vector does not contain the sensitive word, triggering the text feature vector to match in a preset document vector library.

In some embodiments of the first aspect, converting the voice information into text information and extracting features to obtain text feature vectors, and further includes: converting the voice information into text information, carrying out intention recognition on the text information, and determining the intention of a user; extracting an entity in the text information to obtain entity information; determining a first prompting word by inputting the entity information and the user intention into an artificial intelligent large model, and storing the first prompting word into the text information to form new text information, wherein the first prompting word is text expression meaning information; and extracting the characteristics of the new text information to obtain text characteristic vectors.

In some embodiments of the first aspect, converting the voice information into text information and extracting features to obtain text feature vectors, and further includes: converting the voice information into text information, carrying out intention recognition on the text information, and determining the intention of a user; extracting an entity in the text information to obtain entity information; determining a first prompting word by inputting the entity information and the user intention into an artificial intelligent large model, receiving a second prompting word generated by a user in response to the user prompting word, and updating the text information by using the second prompting word to form new text information, wherein the first prompting word and the second prompting word are text expression meaning information respectively; and extracting the characteristics of the new text information to obtain text characteristic vectors.

In some embodiments of the first aspect, the responding to the text feature vector with the artificial intelligence large model and outputting a response result further includes: determining user intention and semantic information corresponding to the text feature vector; loading a preset plug-in and a preset prompt word template associated with the artificial intelligent large model according to the user intention and the semantic information, and determining plug-in description and plug-in parameters of the preset plug-in; loading the plug-in description and plug-in parameters into the preset prompt word template to form prompt information; the prompt word template carries the plug-in and defines the standard format of input and output; and carrying out instruction understanding on the prompt information, determining the type of the service to be executed and an application program interface, calling a corresponding plug-in unit according to the type of the service and the application program interface to respond to the text feature vector, and outputting the response result.

In some embodiments of the first aspect, before loading the preset plug-in and the preset alert word template associated with the artificial intelligence large model according to the user intention and the semantic information, the method further includes: and registering the plug-in, adding a plug-in name, a function description and plug-in parameters to the registering plug-in, and persisting the registering plug-in, wherein the artificial intelligence large model can be inferred to the plug-in name based on the function description, and can finish the linking of the artificial intelligence large model and the plug-in according to the plug-in name and the plug-in parameters.

In some embodiments of the first aspect, comprising: training a preset deep learning model by using a preset training sample, and adjusting parameters of the preset deep learning model to optimize until a preset verification condition is reached, so as to obtain at least one artificial intelligent large model.

In some embodiments of the first aspect, the outputting the document vector if the document vector associated with the text feature vector matches within the preset document vector library includes: respectively carrying out similarity calculation on each document vector according to the text feature vector to obtain vector similarity between the text feature vector and each document vector; and determining a document vector associated with the text feature vector from the preset document vector library according to the vector similarity.

In some embodiments of the first aspect, constructing the preset document vector library includes: pre-acquiring initial document data, splitting the initial document data to obtain a plurality of text fragments, and generating text vectors corresponding to the text fragments; and establishing a preset document vector library according to the corresponding relation between the text fragments and the text vectors.

In some embodiments of the first aspect, splitting the initial document data to obtain a plurality of text segments includes:

splitting the initial document data into a plurality of text data blocks according to a preset document splitting granularity; determining text segmentation positions from the text data blocks according to preset segmentation characters; and splitting the initial document data into a plurality of text fragments according to the text splitting position.

In a second aspect, to achieve the above and other related objects, the present invention provides a virtual digital human interaction processing system, comprising: the text conversion module is used for receiving input voice information, converting the voice information into text information and extracting features to obtain text feature vectors; the first interaction module is used for inputting the text feature vector into a preset document vector library for matching; if the document vector which is associated with the text feature vector is matched in the preset document vector library, outputting the document vector; and the second interaction module is used for responding to the text feature vector by utilizing the artificial intelligence large model and outputting a response result if the document vector associated with the text feature vector is not matched in the preset document vector library.

In a third aspect, to achieve the above and other related objects, the present invention provides a mobile terminal comprising: a memory, a processor and a communication component; the memory is used for storing a computer program; the processor is used for executing the computer program to realize the virtual digital human interaction processing method.

In a fourth aspect, to achieve the above and other related objects, the present invention provides a vehicle-mounted terminal, one or more processors; and one or more machine readable media having instructions stored thereon, which when executed by the one or more processors, implement the virtual digital human interaction based processing method described above.

In a fifth aspect, to achieve the above object and other related objects, the present invention further provides a virtual digital human interaction processing apparatus, including: one or more processors; and one or more machine readable media having instructions stored thereon that, when executed by the one or more processors, cause the apparatus to perform one or more of the virtual digital human interaction based processing methods described previously.

In a sixth aspect, to achieve the above and other related objects, the present invention also provides one or more machine-readable media having instructions stored thereon, which when executed by one or more processors, cause an apparatus to perform one or more of the virtual digital human interaction based processing methods described above.

As described above, the virtual digital human interaction processing method, system, terminal, equipment and medium provided by the invention have the following beneficial effects:

converting the voice information into text information by receiving the input voice information and extracting the characteristics to obtain text characteristic vectors; inputting the text feature vector into a preset document vector library for matching; if the document vector which is associated with the text feature vector is matched in the preset document vector library, outputting the document vector; if the document vectors associated with the text feature vectors are not matched in the preset document vector library, responding to the text feature vectors by using an artificial intelligent large model, and outputting a response result. According to the invention, text feature vectors are directly matched in a preset document vector library for output, or are output through a large model, on one hand, the artificial intelligent large model has strong natural language processing capability, so that the interaction experience of a virtual digital person is improved, the virtual digital person can more naturally interact with a user, meanwhile, the accurate output of the virtual digital person is ensured through vector matching, and the interaction capability of the virtual digital person is also improved; on the other hand, by combining the artificial intelligence large model, the virtual digital person can realize real-time interaction, so that the user requirement is better met, and the method is applicable to more scenes; on the other hand, the method has strong expandability by combining the artificial intelligence large model, and can continuously improve the performance of the virtual digital person along with the increase of data and the optimization of an algorithm.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the application.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings that are needed in the description of the embodiments of the present invention will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a schematic diagram of an implementation environment based on a virtual digital human interaction processing method in an exemplary embodiment of the invention;

FIG. 2 is a flow chart of a virtual digital based human interaction processing method in an exemplary embodiment of the invention;

FIG. 3 is a flow chart of a virtual digital person construction based on a virtual digital person interaction processing method in an exemplary embodiment of the present invention;

FIG. 4 is a flow chart of sensitive word filtering based on the virtual digital human interaction processing method in an exemplary embodiment of the invention;

FIG. 5 is a schematic diagram of a virtual digital human interaction processing system in accordance with an exemplary embodiment of the present invention;

Fig. 6 is a schematic diagram of a hardware structure of a terminal device in an exemplary embodiment of the present invention;

fig. 7 is a schematic diagram of a hardware structure of a terminal device in an exemplary embodiment of the present invention.

Detailed Description

Further advantages and effects of the present invention will become readily apparent to those skilled in the art from the disclosure herein, by referring to the accompanying drawings and the preferred embodiments. The invention may be practiced or carried out in other embodiments that depart from the specific details, and the details of the present description may be modified or varied from the spirit and scope of the present invention. It should be understood that the preferred embodiments are presented by way of illustration only and not by way of limitation.

It should be noted that the illustrations provided in the following embodiments merely illustrate the basic concept of the present invention by way of illustration, and only the components related to the present invention are shown in the drawings and are not drawn according to the number, shape and size of the components in actual implementation, and the form, number and proportion of the components in actual implementation may be arbitrarily changed, and the layout of the components may be more complicated.

In the following description, numerous details are set forth in order to provide a more thorough explanation of embodiments of the present invention, it will be apparent, however, to one skilled in the art that embodiments of the present invention may be practiced without these specific details, in other embodiments, well-known structures and devices are shown in block diagram form, rather than in detail, in order to avoid obscuring the embodiments of the present invention.

In the related art, the current virtual digital person can replace standardized manual service in a large quantity, but the current virtual digital person is a client model built on the basis of a certain type of client, has a standard customer service question-answering library, has a standard business processing flow and the like, and has one-sided performance no matter which type of user is targeted, and lacks an interactive process based on the effective combination of a large model and the virtual digital person so as to improve the conversation accuracy and interactive experience.

Aiming at the defects existing in the scene, the embodiment of the application respectively provides a virtual digital human interaction processing method, a virtual digital human interaction processing system, a mobile terminal, a vehicle-mounted terminal, virtual digital human interaction processing equipment and a computer readable storage medium, and the defects that the interactive experience and the dialogue matching degree are insufficient due to the lack of effective combination of a large model and a virtual digital human in the prior art can be overcome by the virtual digital human interaction processing method, the system, the terminal, the equipment and the medium.

Fig. 1 is a schematic diagram of an implementation environment of a virtual digital human interaction processing method according to an exemplary embodiment of the present application. Referring to fig. 1, the implementation environment includes at least one of the following: the device, the terminal, the equipment is electronic equipment, intelligent terminal etc., and the terminal includes at least one of the following: mobile terminal, vehicle-mounted terminal, by executing the following program in the terminal or device: receiving input voice information, converting the voice information into text information, and extracting features to obtain text feature vectors; inputting the text feature vector into a preset document vector library for matching; if the document vector which is associated with the text feature vector is matched in the preset document vector library, outputting the document vector; if the document vectors associated with the text feature vectors are not matched in the preset document vector library, responding to the text feature vectors by using an artificial intelligent large model, and outputting a response result.

It should be understood that the terminal includes at least one of the following: the mobile terminal and the vehicle-mounted terminal are provided, and the equipment comprises at least one of the following: electronic equipment, intelligent terminal, service end, this terminal includes at least one of the following: smartphones, tablets, notebook computers, etc., wherein the user input interface includes, but is not limited to, a touch screen, keyboard, physical keys, audio pick-up, etc.

The server, which may be a server providing various services, such as a file server, a Web server, a mail server, a database server, an application server, a game server, a DNS server, a proxy server, a media server, and the like. The server may be an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, or a cloud server providing cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, CDN (Content Delivery Network ), basic cloud computing services such as big data and artificial intelligence platform, which are not limited herein.

Referring to fig. 2, fig. 2 is a flowchart of a virtual digital human interaction processing method according to an exemplary embodiment of the present application. It should be understood that the virtual digital human interaction processing method may also be applied to other exemplary implementation environments, and be specifically executed by devices in other implementation environments, and the embodiment is not limited to the implementation environments to which the virtual digital human interaction processing method is applied.

Referring to fig. 2, fig. 2 is a flowchart of a virtual digital human interaction processing method according to an exemplary embodiment of the present application, where the virtual digital human interaction processing method at least includes steps S201 to S203, and the detailed description is as follows:

step S201, receiving input voice information, converting the voice information into text information, and extracting features to obtain text feature vectors;

specifically, voice information of a user is collected through microphones of terminals such as a smart phone, a tablet, a notebook computer and a computer, the voice information is converted into text information, and then feature extraction is carried out on the text information to obtain text feature vectors; for example, voice information of a user is acquired through a recording or voice input device using a voice recognition technology, and the acquired voice information is converted into text information using a voice recognition engine; preprocessing the converted text information, including word segmentation, stop word removal, special symbol removal and other operations, so as to prepare subsequent feature extraction; features are extracted from the preprocessed text information, and may include text features such as word frequency, word length, part of speech, named entities, and the like. The extracted features may be weighted using a method such as TF-IDF (Term Frequency-Inverse Document Frequency) to convert the extracted features into numerical vector form, and these numerical vectors may be used to construct text feature vectors.

Step S202, inputting the text feature vector into a preset document vector library for matching; if the document vector which is associated with the text feature vector is matched in the preset document vector library, outputting the document vector;

specifically, similarity calculation is carried out on each document vector according to the text feature vector, so that vector similarity between the text feature vector and each document vector is obtained; and determining a document vector associated with the text feature vector from the preset document vector library according to the vector similarity.

The construction of the preset document vector library comprises the following steps: pre-acquiring initial document data, splitting the initial document data to obtain a plurality of text fragments, and generating text vectors corresponding to the text fragments; and establishing a preset document vector library according to the corresponding relation between the text fragments and the text vectors.

Splitting the initial document data to obtain a plurality of text fragments, including: splitting the initial document data into a plurality of text data blocks according to a preset document splitting granularity; determining text segmentation positions from the text data blocks according to preset segmentation characters; and splitting the initial document data into a plurality of text fragments according to the text splitting position.

It should be appreciated that splitting the initial document data to obtain a plurality of text segments (Chunk), and generating text vectors corresponding to each text segment; and establishing a preset document vector library according to the corresponding relation between the text fragments and the text vectors.

The method for splitting the initial document data to obtain a plurality of text fragments includes: splitting the initial document data into a plurality of text data blocks according to a preset document splitting granularity; determining text segmentation positions from each text data block according to preset segmentation characters; the initial document data is split into a plurality of text fragments according to the text segmentation locations.

In some embodiments, the document segmentation granularity includes at least one of: pages, segments, etc., or a predetermined text length as a document segmentation granularity, for example, a predetermined text length of 2000Tokens.

In some embodiments, the cut characters include at least one of: periods, question marks, mark marks, etc.

By the method, compared with the method of directly dividing through document segmentation granularity, the method has the advantages that the text segmentation position is determined based on segmentation characters, and the completeness of each text segment is guaranteed, so that text feature vector matching is more accurate, and the accuracy of a virtual digital person in outputting answers to questions is improved.

Step S203, if the document vector associated with the text feature vector is not matched in the preset document vector library, responding to the text feature vector by using an artificial intelligence large model, and outputting a response result.

It should be appreciated that when the document vector associated with the text feature vector is not matched in the preset document vector library, the plug-in is invoked for expanding response through the artificial intelligence large model.

Specifically, training a preset deep learning model by using a preset training sample, and adjusting parameters of the preset deep learning model to optimize until a preset verification condition is reached, so as to obtain at least one artificial intelligent large model.

Here, it should be noted that the AI large model may include any type of machine learning model, which may have various applications in many fields, such as natural language processing, autonomous vehicles, image processing, deep learning robots, automatic machine translation, automatic handwriting generation. The AI large model may have any type of deep learning architecture, such as deep neural networks, recurrent neural networks, and convolutional neural networks.

A simple neural network may comprise several layers, one for receiving an input signal and another for transmitting an output signal, one or more hidden or processing layers may be located between the input and output layers. In deep neural networks constructed to generate one or more inferences, there may be many hidden layers made up of artificial neurons. Such neurons may include activation functions, constant inputs, other inputs, and outputs. The neuron may generate an output by performing an activation function on a weighted version of the input. The inputs to the activation function are weighted according to their respective weights. For example, the input may include normalized data. The activation function may be configured to receive a single number (e.g., a linear combination of weighted inputs) based on all inputs and perform a fixed operation such as sigmoid, tanh, or rectifying linear element options. The constant input may be a constant value.

A single neuron may not itself do so much, and a useful AI large model typically includes a combined computational effort of a large number of neurons working cooperatively. For example, a deep neural network may include a plurality of neurons that are assembled in layers and connected in a cascade. These layers may include an input layer, an output layer, and some hidden layers in between. The output of each layer of neurons may be weighted according to a weight and then serve as input to the neurons in the next layer. Other interconnect strategies known in the art may be employed.

Neurons of the input layer may be configured to receive normalized or otherwise feature engineered or processed data corresponding to user data. The output of each neuron of the input layer or hidden layer may be weighted according to the weight of its corresponding output edge and then applied as input at each neuron of the next layer. The output(s) of the output layer include the outputs of DNN or AI large models. In the context of reasoning, such output may be reasoning(s) or prediction(s). Construction of such DNNs is simply the beginning of generating a useful machine learning or AI large model. The accuracy of the reasoning generated by such AI large models requires the selection of a suitable activation function, and then each weight of the entire model is adjusted to provide an accurate output. The process of adjusting such weights is referred to as "training". Training DNNs or other types of networks requires a set of training data with known characteristics. For example, if the DNN is intended to predict the probability that an input image of an animal is a cat, the training data will include images of many different cats, and typically include images of not only cats, but also other similar animals. Training entails preprocessing the image data corresponding to each image according to normalization and/or feature extraction techniques known in the art to produce input features for the DNN, which are then provided as inputs to the network, e.g., as inputs to neurons of the input layer.

Thereafter, each neuron of one layer performs its respective activation operation, the output of which is weighted and fed forward in forward pass to the next layer until the output(s) of DNN are generated by the output layer. The output(s) of the DNN may be compared to known or expected values of the outputs, and the differences may be back fed in back through the back pass of the DNN to adjust the weights contained therein according to back-propagation algorithms known in the art. With the AI large model including updated weights, image features can again be input to the model and new outputs generated. Training involves iterating the AI large model over the training dataset and updating the weights at each iteration. Once the AI large model has reached sufficient accuracy, or its output has converged and the weight change has little effect, the AI large model is said to have been trained. The trained model can then be used to evaluate any input data whose nature was not known in advance, which was not previously considered (e.g., a new picture of an animal), and to output the desired reasoning (e.g., the probability that the image is that of a cat).

Gradient descent is an algorithm often used in training AI large models. Gradient descent involves an objective function (e.g., a loss function or cost function), many of which are possible, and the goal is to minimize the function. The objective function is used to monitor errors in the prediction of the AI large model. Therefore, by minimizing the function, the lowest error value can be found, thereby improving the accuracy of the AI large model. Random gradient descent (SGD) is a variant of the gradient descent algorithm that calculates the error and updates the model for each sample in the training dataset. SGD updates are frequent and learning is faster, but is computationally expensive and may require longer time to train on a large dataset. Batch SGD is another variant that calculates the error for each sample of the training dataset, but updates the AI large model only after the entire dataset is executed (i.e., at the end of the training phase). Batch SGD updates are less and more computationally efficient than SGD. The separation of batch SGD prediction error computation and model updating makes the algorithm suitable for parallel processing based implementations, but updating at the end of the training period requires additional complexity to accumulate the prediction error for the entire dataset and is typically implemented in a manner that requires the entire training dataset in memory and is available to the algorithm. Mini-batch SGD is yet another variation of SGD that splits the training dataset into small batches that are used to calculate model errors and update parameters. The implementation can sum the gradients over the mini-lot, further reducing the variance of the gradients. Thus, mini-batch SGD balances between SGD and batch SGD. Mini-batch SGD requires additional "mini-batch size" superparameters to be configured for the learning algorithm. Error information may be accumulated in training examples of mini-batches. The mini-batch size may be configured as an aspect of the computing architecture on which the AI large model is being executed, e.g., a power of 2, such as 32, 64, 128, 256, etc., that fits the memory requirements of the target device or accelerator hardware. The batch size may be used as an adjustment to the learning process, where smaller values give a learning process that converges quickly at the cost of noise in the training process, while larger values give a learning process that converges slowly with an accurate estimate of the error gradient.

In some embodiments, please refer to fig. 3, which is a flow chart of a virtual digital person construction based on a virtual digital person interaction processing method in an exemplary embodiment of the present invention; comprising the following steps:

step S301, according to at least one of the following feature data: preset appearance characteristic data, preset action characteristic data, preset expression characteristic data and preset voiceprint characteristic data, and correspondingly constructing at least one of the following models of the virtual digital person: and integrating at least one model to obtain the virtual digital person.

Specifically, for example, the appearance model creates an appearance model of a virtual digital person through 3D modeling software (such as Blender), firstly, a basic model is created by using modeling software according to preset appearance characteristic data; then, by adjusting parameters such as texture, color and the like, the model can be more vivid. As another example, motion models may use motion capture techniques (e.g., marker-based Motion Capture) to capture and convert the motion of a real person to that of a virtual digital person, which may be stored as an animated sequence and then played back in a virtual environment or display to produce a realistic motion effect. For another example, by presetting expression feature data for driving the facial expression of the virtual digital person, the facial expression of the real person may be captured using a facial capture technique (e.g., infrared Imaging) and converted into the expression of the virtual digital person, and then, these expression data are applied to the facial model of the virtual digital person to produce a realistic expression effect. For another example, the voice model is used for generating the voice of the virtual digital person through the preset voiceprint feature data, a voice synthesis technology (such as Text-to-Speech) can be used for converting the Text into the voice and applying the voice to the virtual digital person, and the voice of the virtual digital person can be made to sound more natural through adjusting parameters such as speed, tone color and the like of the voice.

The constructed appearance model, action model, expression model and voice model are integrated together, so that the virtual digital person can be displayed, for example, the virtual digital person can be run on a computer and displayed on a screen, or the virtual digital person can be imported into a virtual reality environment, so that a user can interact with the virtual digital person.

Step S302, an operation scheduling engine of the virtual digital person is constructed, emotion categories of texts to be output are identified by the operation scheduling engine, and expression actions corresponding to the emotion categories are determined, so that the virtual digital person executes corresponding expressions and actions.

In this embodiment, by calling the operation scheduling engine, the expression actions conforming to the corresponding emotion categories can be loaded according to the emotion categories carried by the text to be output (i.e., the document vector), so that the virtual digital person outputs the corresponding expressions and actions, and thus, the matched actions, expressions and the combination of voice and appearance are called to perform multi-information fusion expression, and the vividness and emotion expression can be performed like a real human, so as to improve the true emotion and reality of the virtual digital person.

In some embodiments, before receiving the input voice information, the method further includes: if the preset voice information is monitored or the face image is detected, the virtual digital person is awakened, so that the virtual digital person enters a human-computer interaction state.

Specifically, the voice information is monitored in real time, if the preset voice information is monitored, or if the face image legally recorded (the face image authorized by the target user is collected) is received in real time, the preset target face image is detected, and thus the trigger is performed as the wake-up signal in the above manner, for example, when the wake-up signal is received, the virtual digital person enters a man-machine interaction state, and after the wake-up, the virtual digital person may include the actions of opening eyes, moving bodies, speaking and the like, so that the man-machine interaction with the user can be performed subsequently.

Through the mode, the virtual digital person is awakened, so that the virtual digital person is in a working state, and human-computer interaction with a user is facilitated.

In some embodiments, after the face image is detected, the method further includes: comparing the face image with a preset portrait library to obtain user identity information; and if the user identity information is an acquaintance, controlling eyes of the virtual digital person to move along with the user, performing voice synthesis on a preset calling text by using a voice model to generate calling audio, identifying emotion types of the preset calling text, and calling the action model and the expression model to determine expression actions corresponding to the emotion types so as to enable the virtual digital person to execute corresponding expressions and actions and play the calling audio.

For example, emotion analysis of the text by natural language processing technology, judging emotion types (such as happiness, sadness, anger and the like) of the text, and calling an action model and an expression model to determine corresponding expression actions according to the identified emotion types; for example, if the identified emotion classification is happy, the virtual digital person may then display happy expressions and actions by invoking the action model and expression model associated with the happy, and play the generated call-in audio while keeping eye-tracking for the acquaintance, so that the virtual digital person may be operated in a virtual environment or display screen to allow the user to see and hear the reactions of the virtual digital person.

By the method, if the current user is detected to be an acquaintance, the eyes of the virtual digital person are adjusted to move along with the user, namely, the eyes of the virtual digital person are enabled to face the user, so that call is called for the user, and subsequent communication is conducted.

Referring to fig. 4, a flowchart of filtering sensitive words based on a virtual digital human interaction processing method according to an exemplary embodiment of the present invention includes:

step S401, inputting the text feature vector into a preset sensitive word stock pair for matching;

Specifically, the preset sensitive word library is a preset word library, wherein various possible sensitive words or phrases are included, and the sensitive word library can be used for checking whether the input text feature vector contains the sensitive words or not.

Step S402, if the text feature vector contains sensitive words, identifying the types of the sensitive words in the text feature vector, and determining preset sensitive texts according to the types of the sensitive words so that virtual digital people respond according to the preset sensitive texts;

specifically, if the text feature vector contains a sensitive word, the sensitive word category in the text feature vector needs to be identified, and a preset sensitive text is determined according to the identified sensitive word category; the virtual digital person then responds according to the preset sensitive text, either automatically or manually.

Step S403, if the text feature vector does not contain the sensitive word, triggering the text feature vector to match in a preset document vector library.

Specifically, if the text feature vector does not contain a sensitive word, the text feature vector is triggered to be matched in a preset document vector library, the purpose of each step is to find a preset document which is matched with the input text, and then, a virtual digital person responds according to the preset document.

By the method, when the user inputs the text, whether the text contains the sensitive word is automatically detected, and the corresponding response text is automatically generated according to the type of the sensitive word or other matching conditions, so that the virtual digital person is helped to interact with the user more naturally and accurately.

In other embodiments, the method converts the voice information into text information and performs feature extraction to obtain text feature vectors, and further includes: converting the voice information into text information, carrying out intention recognition on the text information, and determining the intention of a user; extracting an entity in the text information to obtain entity information; determining a first prompting word by inputting the entity information and the user intention into an artificial intelligent large model, and storing the first prompting word into the text information to form new text information, wherein the first prompting word is text expression meaning information; and extracting the characteristics of the new text information to obtain text characteristic vectors.

Specifically, the commonly used entity relation extraction method includes a rule-based entity relation extraction method and a machine learning-based entity relation extraction method, and the entity in the text information is extracted to obtain entity information. For example, if the text information is a query "where is the nearest gas station? By "it can be seen as an instance intended to" find a gas station ", which can be seen as an entity.

In addition, optionally, text corpus can be collected to construct an entity dictionary, and a preliminary accurate entity dictionary can be obtained; training an entity recognition model according to an entity dictionary, correcting the entity recognition model through a preset algorithm, constructing positive and negative text corpora, taking the entity which is judged to be non-entity through a semantic environment and the text in which the entity is positioned as negative text corpora during correction, and carrying out a retrieval strategy on the negative sample corpora again through keywords in the text corpora, so that the accuracy recall rate of entity recognition is improved.

Optionally, in some embodiments, the text information is identified using the artificial intelligence large model, and the user intent is determined, e.g., trained using an artificial intelligence large model to be an intent recognition model, specifically for recognizing the user intent of the text information.

Alternatively, the artificial intelligence large model may also employ the idea of constructing a generic model to cooperatively identify user intentions with a plurality of expert models, for example, by identifying generic intentions of text information in a large category through the generic model, and by identifying sub-intentions under the generic intentions through the plurality of expert models.

Specifically, text information is respectively input into a general model and a plurality of expert models to obtain an intention probability recognition sequence PB and PSi, wherein PSi is an intention probability recognition sequence corresponding to each of the expert models, the intention probabilities corresponding to the same intention in the intention probability recognition sequence PB and PSi are normalized, and finally the intention with the highest probability in the normalized intention probability recognition sequence is used as the intention output of the text information.

The general model and the expert model represent the BERT model for bidirectional coding based on a transformer, the BERT model comprises N layers of feature encoders, and each layer of feature encoder is respectively connected with a classification. The language model with bidirectional training can be used for understanding the context more deeply than the unidirectional language model, so that the text can be accurately processed, and the intention recognition capability is better.

In this embodiment, the intention recognition fully considers the recognition result of the general intention under the large category and the recognition result of the sub-intention under certain categories, and synthesizes the general intention recognized by the general model and the sub-intentions recognized by the plurality of expert models to determine the final intention of the user, so that the actual intention of the user can be rapidly distinguished from the similar intention according to the recognition result of the sub-intention, and the accuracy of the intention recognition is improved.

Specifically, the large model is used again, prediction is performed according to the association relationship between entity information and the user intention, and a first prompt word is determined, where the first prompt word is instruction expression meaning information, for example, the first prompt word may be determined according to a task type, a main object, and text sentence consistency and sensitivity, and the first prompt word includes but is not limited to: region, time, document context, etc.

For example, natural language processing techniques may be used to extract text features and machine learning algorithms may be used to predict the hint words associated with the entity and intent.

For another example, if the voice information input by the user voice is "i need a pizza," it may be predicted that the prompt associated with "order food" intent and "pizza" entity is "what is a pizza of what store? Or can delivery go to the gate? ". And selecting the most suitable prompting words to guide the conversation flow according to the prediction result and the context. For example, if the prediction result is "what is a pizza of which store? ", nearby pizza shops may be recommended to the user.

Through the method, the first prompt word is used as a part of the task instruction (namely, the converted text information) to be inquired, the task instruction (namely, the target problem) is supplemented through the prompt information, so that the pertinence of the target problem is improved, when the target problem and the prompt information are input into the large model, the model can jointly generate a problem answer by means of the training sample and the prompt information, and therefore the accuracy of an interaction system in outputting the problem answer is improved, and the artificial intelligent large model is enabled to be more accurate in reasoning or prediction based on the inquiry.

In other embodiments, the method converts the voice information into text information and performs feature extraction to obtain text feature vectors, and further includes:

converting the voice information into text information, carrying out intention recognition on the text information, and determining the intention of a user; extracting an entity in the text information to obtain entity information; determining a first prompting word by inputting the entity information and the user intention into an artificial intelligent large model, receiving a second prompting word generated by a user in response to the user prompting word, and updating the text information by using the second prompting word to form new text information, wherein the first prompting word and the second prompting word are text expression meaning information respectively; and extracting the characteristics of the new text information to obtain text characteristic vectors.

It should be appreciated that, compared to the above manner, the second prompting word further defines the prompting information, and the second prompting word can reflect the text information description (intention and requirement) of the user, that is, the second prompting word is more accurate relative to the first prompting word, and the second prompting word and the first prompting word belong to the prompting word of the user and can be checked by the user.

In particular, the use of more accurate second prompt words may help the large model better understand the intent and needs of the user, thereby more accurately matching the corresponding answers. For example, when a user inputs a question that compares to a general one, the large model must not understand the user's intent accurately, resulting in the answer to the question answering being on the fly. If the second prompt word is used, not only the user's intent can be more specifically described, but also the large model can be assisted to more accurately match the corresponding answer.

By means of the method, the text information is updated by using the second prompt words, so that the technical problem expressed by the text information can be more accurate, and the accuracy of large model prediction or reasoning can be improved.

Optionally, on the basis of obtaining the prompt word in the above embodiment, the method further includes:

if the text data corresponding to the prompt word is detected to exceed the preset length, splitting the text data corresponding to the prompt word to obtain a plurality of text fragments, and generating text vectors, fragment identifications and text keywords corresponding to the text fragments; establishing a knowledge vector database according to the corresponding relation among the text fragments, the text vectors, the fragment identifications and the text keywords; acquiring a target problem (i.e., a task instruction); generating a problem vector corresponding to the target problem; determining a target segment corresponding to the target problem from the text segments of the knowledge vector database according to the text vector corresponding to the problem vector; and generating a question prompt word corresponding to the text information according to at least one part of the target fragment.

In this embodiment, query is performed from a preset knowledge vector database through a question vector corresponding to the target question, so as to obtain prompt information corresponding to the target question, and thus the target question and the prompt information are input into the AI large model, and a question answer corresponding to the target question is generated. Therefore, the prompt information of the target problem is determined by utilizing the knowledge vector database, and the target problem is supplemented by the prompt information, so that the pertinence of the target problem is improved, and when the target problem and the prompt information are input into the AI large model, the model can jointly generate a problem answer by means of the training sample and the prompt information, so that the accuracy of the AI large model in outputting the problem answer is improved.

By the method, compared with the method of directly dividing through document segmentation granularity, the method has the advantages that the text segmentation position is determined based on segmentation characters, and the integrity of each text segment is guaranteed, so that the prompt information of a target question is more accurate, and the accuracy of a question-answering system in outputting answers to the questions is improved;

because the AI large model has input limit, text keywords are extracted from the text fragments, and question prompt words corresponding to the target questions are generated according to the text keywords, compared with the whole text fragments, the data size of the text keywords is smaller, so that the input limit of the AI large model is met, meanwhile, the pertinence is stronger, and the question-answering efficiency of the question-answering large model is improved; the method and the system not only generate the question answers through the current target questions, but also generate the question recommendation information according to the history records associated with the target questions, so that more information is provided for users, and the comprehensiveness of the output of the question-answering large model is improved.

In some embodiments, the responding to the text feature vector with the artificial intelligence big model and outputting a response result further includes:

determining user intention and semantic information corresponding to the text feature vector;

loading a preset plug-in and a preset prompt word template associated with the artificial intelligent large model according to the user intention and the semantic information, and determining plug-in description and plug-in parameters of the preset plug-in;

loading the plug-in description and plug-in parameters into the preset prompt word template to form prompt information; the prompt word template carries the plug-in and defines the standard format of input and output;

and carrying out instruction understanding on the prompt information, determining the type of the service to be executed and an application program interface, calling a corresponding plug-in unit to respond to the text feature vector according to the type of the service and the application program interface, and outputting a response result.

Specifically, it should be noted that the text information is generated after the second prompt word is updated, or the text information includes the first prompt word. The semantic information is determined on the basis of part-of-speech tagging and syntactic analysis by sequentially performing word segmentation, part-of-speech tagging and syntactic analysis on text sentences corresponding to the text information.

Specifically, matching is performed on a plug-in library according to the user intention and the semantic information, for example, character similarity matching is performed, so that a preset plug-in for loading association is determined, and meanwhile, plug-in description and plug-in parameters of the preset plug-in are determined.

Calling a related preset prompt word template according to the user intention and the semantic information, and loading the plug-in description and plug-in parameters into the preset prompt word template to form prompt information; the prompt word template carries the plug-in and defines the standard format of input and output;

specifically, matching is performed on a prompt word template library according to the user intention and the semantic information, a preset prompt word template is matched, and prompt information (plug-in or application prompt word) is generated in the preset prompt word template, wherein the prompt information comprises plug-in description and plug-in parameters, so that a large model can accurately call plug-in according to the prompt information, the prompt information belongs to system prompt, and a user cannot view the prompt information.

And carrying out instruction understanding on the prompt information, and determining the type of the service to be executed and the application program interface.

It should be noted that the plug-in name and description are used for the large model to infer the plug-in name from the description. The prompt word template is used for bearing plug-in components and defining standard input and output in the template. The prompt word is based on the name and description of the parameter list dynamically acquired by the plug-in. The voice information after voice input of the user, namely, text information, is inferred by using a large model, and the values of parameter mapping and the plug-in names are obtained to complete the linking of the model and the plug-in.

By the method, the plug-in framework is integrated in the AI large model, and various plug-ins are integrated according to the plug-in framework by using the large model, so that the situation that the traditional AI large model can only call a specific plug-in, and any plug-in can be called by applying the prompt word is avoided, the capability of calling various plug-ins by the AI large model is greatly improved, and the interactive processing capability of the AI model is also expanded.

In the foregoing embodiment, before loading the preset plugin associated with the semantic information according to the user intention, the method further includes: and registering a plug-in, adding a plug-in name, a function description and a plug-in parameter to the registered plug-in, and persisting the registered plug-in, wherein the artificial intelligence large model is inferred to the plug-in name based on the function description, and the linking of the artificial intelligence large model and the plug-in is completed according to the plug-in name and the plug-in parameter.

Specifically, by teaching how to register a plug-in and add related information, including the plug-in name, function description, and plug-in parameters, this registration information is then persisted for future use. When relevant description information is added, the artificial intelligent large model needs to be made to reason to the plug-in name based on the function description, and meanwhile, the linkage of the artificial intelligent large model and the plug-in is completed according to the plug-in name and the plug-in parameters.

Optionally, a plurality of registration plugins are asynchronously trained, so that labeling and training of each plugin are rapidly completed, and the registration plugins are rapidly obtained through an AI large model.

In the above embodiment, calling the corresponding plugin to respond according to the service type and the application program interface includes: judging whether to call a plug-in or not according to the current service type and the application program interface by using the artificial intelligent large model; if yes, determining an output parameter according to the plug-in name, and triggering an actuator to respond according to the plug-in associated with the output parameter; if not, adopting the artificial intelligence large model to respond.

It should be understood that whether the plug-in needs to be called is judged through the current service type and the application program interface, if the plug-in needs to be called, the name of the plug-in is determined according to the service type and the application program interface, so that the related parameters describing the plug-in are rapidly output, and smooth response of the AI large model calling plug-in is facilitated; conversely, if not, i.e., without calling a plug-in, the AI large model directly responds to the output result.

In the above embodiment, the actuator (i.e., the insert) includes at least one of: interface executor, script executor, interface control executor, flow executor, external knowledge base.

Specifically, if the plug-in is an interface executor, invoking an http (Hyper Text Transfer Protocol, transmission hypertext protocol) request tool, a socket (socket) request tool, and a protobuf (Google Protocol Buffers, used for structuring a serialization library of a data storage and communication protocol) request tool to respond; if the plug-in is a script executor, invoking a JS (JavaScript, webpage programming language) script analysis and execution tool and a Python (supporting Chinese annotation advanced programming language) script analysis and execution tool to respond; if the plug-in is an interface control executor, calling a VUE (building user interface framework) plug-in, an android plug-in and an IOS (apple operating system) plug-in to respond; if the plug-in is a flow executor, calling a flow analysis and execution tool to respond; if the plug-in is an external knowledge base, the AI large model directly responds.

By means of the mode, firstly, the large model can be seamlessly integrated with various different plug-ins by supporting the calling of various plug-ins, and therefore more functions and application scenes are achieved. Therefore, the original model code can be prevented from being redeveloped or modified, and the development efficiency and flexibility are improved. Second, the plug-ins are typically optimized for a particular function, so that development from scratch can be avoided, saving time and effort using existing plug-ins; meanwhile, the plug-in design is convenient for maintenance and updating, and development cost and maintenance cost can be reduced. Third, through the plug-in design, the developer can flexibly customize and adjust the functions and the manifestations of the model according to specific requirements. Therefore, the requirements of different fields and scenes can be met, and the pertinence and the market competitiveness of the application are improved. Fourth, with a plug-in design, different teams and developers can work better together. Each plug-in can be independently developed and tested and then combined and integrated with other plug-ins, so that communication cost and conflict in the development process are reduced.

In summary, the present application proposes a new virtual digital human interaction processing method, which receives input voice information, converts the voice information into text information and performs feature extraction to obtain text feature vectors; inputting the text feature vector into a preset document vector library for matching; if the document vector which is associated with the text feature vector is matched in the preset document vector library, outputting the document vector; if the document vectors associated with the text feature vectors are not matched in the preset document vector library, responding to the text feature vectors by using an artificial intelligent large model, and outputting a response result. According to the invention, text feature vectors are directly matched in a preset document vector library for output, or the text feature vectors are output through a large model calling plug-in, on one hand, the artificial intelligence large model has strong natural language processing capability, the interaction experience of a virtual digital person is improved, so that the virtual digital person can more naturally interact with a user, meanwhile, the accurate output of the virtual digital person is ensured through matching, and the interaction capability of the virtual digital person is also improved; on the other hand, by combining the artificial intelligence large model, the virtual digital person can realize real-time interaction, so that the user requirement is better met, and the method is applicable to more scenes; on the other hand, the method has strong expandability by combining the artificial intelligence large model, and can continuously improve the performance of the virtual digital person along with the increase of data and the optimization of an algorithm.

Fig. 5 is a block diagram of a virtual digital based human interaction processing system, as shown in an exemplary embodiment of the present application. The system may be applied to the implementation environment shown in fig. 1 and is specifically configured in a server. The virtual digital man-interaction-based processing system can also be suitable for other exemplary implementation environments, and is specifically configured in other devices or terminals, and the implementation environment suitable for the virtual digital man-interaction-based processing system is not limited in this embodiment.

As shown in fig. 5, the present invention further provides a virtual digital human interaction processing system 500, which includes:

the text conversion module 501 is configured to receive input voice information, convert the voice information into text information, and perform feature extraction to obtain a text feature vector;

the first interaction module 502 is configured to input the text feature vector into a preset document vector library for matching; if the document vector which is associated with the text feature vector is matched in the preset document vector library, outputting the document vector;

and a second interaction module 503, configured to respond to the text feature vector by using an artificial intelligence large model if the document vector associated with the text feature vector is not matched in the preset document vector library, and output a response result.

It should be noted that, the virtual digital human interaction-based processing system provided in the above embodiment and the virtual digital human interaction-based processing method provided in the above embodiment belong to the same concept, and the specific manner in which each module and unit perform the operation has been described in detail in the virtual digital human interaction-based processing method embodiment, which is not described herein again. In practical application provided by the above embodiment, the above function allocation may be performed by different functional modules according to needs, that is, the internal structure of the device is divided into different functional modules to perform all or part of the functions described above, which is not limited herein.

The embodiment of the application also provides a mobile terminal, which comprises: a memory for storing a computer program and may be configured to store various other data to support operations on the mobile terminal. Examples of such data include instructions for any application or method operating on the mobile terminal.

The memory may be implemented by any type of volatile or nonvolatile memory device or combination thereof, such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disk.

And the communication component is used for carrying out data transmission with other devices.

And a processor, executable computer instructions stored in the memory, causing the processor to perform the virtual digital based human interaction processing method described in fig. 2.

It should be noted that, a mobile terminal refers to a terminal that is held by a user, such as a terminal device of a smart phone, a tablet computer, a notebook, etc., and generally, the mobile terminal has an access capability of the internet, such as a subscriber identity module (SIM, subscriber Identity Module) module set by the mobile terminal, and a universal subscriber identity module (USIM, universal Subscriber Identity Module) to perform cellular communication to access the internet.

The embodiment of the application also provides a vehicle-mounted terminal, which comprises: one or more processors; and one or more machine readable media having instructions stored thereon that, when executed by the one or more processors, cause the apparatus to perform the virtual digital human interaction based processing method described in fig. 2.

It should be noted that the vehicle-mounted terminal, the terminal for realizing man-machine interaction, which is disposed in the vehicle, provides various optional functions, such as viewing and controlling of the vehicle state, reservation and use of services related to the vehicle (such as car washing, oiling, etc.), internet-based applications such as social networking, web browsing, etc. As another example, a vehicle positioning and navigation function, a multimedia playing function, etc.

The embodiment of the application also provides a virtual digital human interaction-based processing device, which can comprise: one or more processors; and one or more machine readable media having instructions stored thereon that, when executed by the one or more processors, cause the apparatus to perform the virtual digital human interaction based processing method described in fig. 2. In practical applications, the device may be used as a terminal device or may be used as a server, and examples of the terminal device may include: smart phones, tablet computers, e-book readers, MP3 (dynamic video expert compression standard voice plane 3,Moving Picture Experts Group Audio Layer III) players, MP4 (dynamic video expert compression standard voice plane 4,Moving Picture Experts Group Audio Layer IV) players, laptop computers, car computers, desktop computers, set-top boxes, smart televisions, wearable devices, etc., the embodiments of the present application are not limited to specific devices.

The embodiment of the application also provides a non-volatile readable storage medium, in which one or more modules (programs) are stored, where the one or more modules are applied to a device, and the device may be caused to execute instructions (instructions) of steps included in the virtual digital human interaction processing method in fig. 2 according to the embodiment of the application.

Fig. 6 is a schematic hardware structure of a terminal device according to an embodiment of the present application. As shown, the terminal device may include: an input device 1100, a first processor 1101, an output device 1102, a first memory 1103 and at least one communication bus 1104. The communication bus 1104 is used to enable communication connections between the elements. The first memory 1103 may include a high-speed RAM memory, and may further include a nonvolatile memory NVM, such as at least one magnetic disk memory, where various programs may be stored in the first memory 1103 to perform various processing functions and implement the steps of the virtual digital human interaction processing method according to the present embodiment.

Alternatively, the first processor 1101 may be implemented as, for example, a central processing unit (Central Processing Unit, abbreviated as CPU), an Application Specific Integrated Circuit (ASIC), a Digital Signal Processor (DSP), a Digital Signal Processing Device (DSPD), a Programmable Logic Device (PLD), a Field Programmable Gate Array (FPGA), a controller, a microcontroller, a microprocessor, or other electronic components, and the first processor 1101 is coupled to the input device 1100 and the output device 1102 through a wired or wireless connection.

Alternatively, the input device 1100 may include a variety of input devices, for example, may include at least one of a user-oriented user interface, a device-oriented device interface, a programmable interface of software, a camera, and a sensor. Optionally, the device interface facing to the device may be a wired interface for data transmission between devices, or may be a hardware insertion interface (such as a USB interface, a serial port, etc.) for data transmission between devices; alternatively, the user-oriented user interface may be, for example, a user-oriented control key, a voice input device for receiving voice input, and a touch-sensitive device (e.g., a touch screen, a touch pad, etc. having touch-sensitive functionality) for receiving user touch input by a user; optionally, the programmable interface of the software may be, for example, an entry for a user to edit or modify a program, for example, an input pin interface or an input interface of a chip, etc.; the output device 1102 may include a display, sound, or the like.

In this embodiment, the processor of the terminal device may include a function for executing each module in each device, and specific functions and technical effects may be referred to the above embodiments and are not described herein again.

Fig. 7 is a schematic hardware structure of a terminal device according to an embodiment of the present application. Fig. 7 is a specific embodiment of the implementation of fig. 6. As shown, the terminal device of the present embodiment may include a second processor 1201 and a second memory 1202.

The second processor 1201 executes the computer program code stored in the second memory 1202 to implement the steps of the virtual digital human interaction based processing method described in fig. 2 in the above embodiment.

The second memory 1202 is configured to store various types of data to support operations at the terminal device. Examples of such data include instructions for any application or method operating on the terminal device, such as messages, pictures, video, etc. The second memory 1202 may include a random access memory (random access memory, simply RAM) and may also include a non-volatile memory (non-volatile memory), such as at least one disk memory.

Optionally, a second processor 1201 is provided in the processing assembly 1200. The terminal device may further include: a communication component 1203, a power component 1204, a multimedia component 1205, a voice component 1206, an input/output interface 1207, and/or a sensor component 1208. The components and the like specifically included in the terminal device are set according to actual requirements, which are not limited in this embodiment.

The processing component 1200 generally controls the overall operation of the terminal device. The processing assembly 1200 may include one or more second processors 1201 to execute instructions to perform all or part of the steps in the virtual digital human interaction based processing method described above. Further, the processing component 1200 may include one or more modules that facilitate interactions between the processing component 1200 and other components. For example, the processing component 1200 may include a multimedia module to facilitate interaction between the multimedia component 1205 and the processing component 1200.

The power supply component 1204 provides power to the various components of the terminal device. Power supply components 1204 can include a power management system, one or more power supplies, and other components associated with generating, managing, and distributing power for terminal devices.

The multimedia component 1205 includes a display screen that provides an output interface between the terminal device and the user. In some embodiments, the display screen may include a Liquid Crystal Display (LCD) and a Touch Panel (TP). If the display screen includes a touch panel, the display screen may be implemented as a touch screen to receive input signals from a user. The touch panel includes one or more touch sensors to sense touches, swipes, and gestures on the touch panel. The touch sensor may sense not only the boundary of a touch or slide action, but also the duration and pressure associated with the touch or slide operation.

The voice component 1206 is configured to output and/or input voice signals. For example, the voice component 1206 includes a Microphone (MIC) configured to receive external voice signals when the terminal device is in an operational mode, such as a voice recognition mode. The received voice signals may be further stored in the second memory 1202 or transmitted via the communication component 1203. In some embodiments, the voice component 1206 further includes a speaker for outputting voice signals.

The input/output interface 1207 provides an interface between the processing assembly 1200 and peripheral interface modules, which may be click wheels, buttons, and the like. These buttons may include, but are not limited to: volume button, start button and lock button.

The sensor assembly 1208 includes one or more sensors for providing status assessment of various aspects for the terminal device. For example, the sensor assembly 1208 may detect an on/off state of the terminal device, a relative positioning of the assembly, and the presence or absence of user contact with the terminal device. The sensor assembly 1208 may include a proximity sensor configured to detect the presence of nearby objects in the absence of any physical contact, including detecting the distance between the user and the terminal device. In some embodiments, the sensor assembly 1208 may also include a camera or the like.

The communication component 1203 is configured to facilitate communication between the terminal device and other devices in a wired or wireless manner. The terminal device may access a wireless network based on a communication standard, such as WiFi,2G or 3G, or a combination thereof. In one embodiment, the terminal device may include a SIM card slot therein for inserting a SIM card, so that the terminal device may log into a GPRS network and establish communication with a server via the internet.

From the above, the communication component 1203, the voice component 1206, the input/output interface 1207, and the sensor component 1208 in the embodiment of fig. 7 can be implemented as the input device in the embodiment of fig. 6.

The above embodiments are merely illustrative of the principles of the present invention and its effectiveness, and are not intended to limit the invention. Modifications and variations may be made to the above-described embodiments by those skilled in the art without departing from the spirit and scope of the invention. Accordingly, it is intended that all equivalent modifications and variations of the invention be covered by the claims, which are within the ordinary skill of the art, be within the spirit and scope of the present disclosure.

Claims

1. A virtual digital human interaction-based processing method, which is characterized by comprising the following steps:

Receiving input voice information, converting the voice information into text information, and extracting features to obtain text feature vectors;

inputting the text feature vector into a preset document vector library for matching; if the document vector which is associated with the text feature vector is matched in the preset document vector library, outputting the document vector;

if the document vectors associated with the text feature vectors are not matched in the preset document vector library, responding to the text feature vectors by using an artificial intelligent large model, and outputting a response result.

2. The virtual digital human interaction based processing method of claim 1, further comprising, prior to receiving the input voice information:

according to at least one of the following characteristic data: preset appearance characteristic data, preset action characteristic data, preset expression characteristic data and preset voiceprint characteristic data, and correspondingly constructing at least one of the following models of the virtual digital person: and integrating at least one model to obtain the virtual digital person.

3. The virtual digital person interaction processing method according to claim 2, wherein an operation scheduling engine of the virtual digital person is constructed, emotion categories of texts to be output are identified by the operation scheduling engine, and expression actions corresponding to the emotion categories are determined so that the virtual digital person executes corresponding expressions and actions.

4. The virtual digital human interaction based processing method of claim 2, further comprising, prior to receiving the input voice information:

if the preset voice information is monitored or the face image is detected, the virtual digital person is awakened, so that the virtual digital person enters a human-computer interaction state.

5. The virtual digital human interaction based processing method of claim 4, further comprising, after the face image is detected:

comparing the face image with a preset portrait library to obtain user identity information; and if the user identity information is an acquaintance, controlling eyes of the virtual digital person to move along with the user, performing voice synthesis on a preset calling text by using a voice model to generate calling audio, identifying emotion types of the preset calling text, and calling the action model and the expression model to determine expression actions corresponding to the emotion types so as to enable the virtual digital person to execute corresponding expressions and actions and play the calling audio.

6. The virtual digital human interaction processing method according to claim 1, wherein before the text feature vector is input into a preset document vector library for matching, the method further comprises:

Inputting the text feature vector into a preset sensitive word stock pair for matching;

if the text feature vector contains sensitive words, identifying the types of the sensitive words in the text feature vector, and determining a preset sensitive text according to the types of the sensitive words so that a virtual digital person responds according to the preset sensitive text;

and if the text feature vector does not contain the sensitive word, triggering the text feature vector to match in a preset document vector library.

7. The virtual digital human interaction processing method according to claim 1, wherein converting the voice information into text information and extracting features to obtain text feature vectors, further comprising:

converting the voice information into text information, carrying out intention recognition on the text information, and determining the intention of a user;

extracting an entity in the text information to obtain entity information;

determining a first prompting word by inputting the entity information and the user intention into an artificial intelligent large model, and storing the first prompting word into the text information to form new text information, wherein the first prompting word is text expression meaning information;

and extracting the characteristics of the new text information to obtain text characteristic vectors.

8. The virtual digital human interaction processing method according to claim 1, wherein converting the voice information into text information and extracting features to obtain text feature vectors, further comprising:

extracting an entity in the text information to obtain entity information;

determining a first prompting word by inputting the entity information and the user intention into an artificial intelligent large model, receiving a second prompting word generated by a user in response to the user prompting word, and updating the text information by using the second prompting word to form new text information, wherein the first prompting word and the second prompting word are text expression meaning information respectively;

9. The virtual digital human interaction based processing method of claim 1, wherein the responding to the text feature vector using an artificial intelligence large model and outputting a response result further comprises:

and carrying out instruction understanding on the prompt information, determining the type of the service to be executed and an application program interface, calling a corresponding plug-in unit according to the type of the service and the application program interface to respond to the text feature vector, and outputting the response result.

10. The virtual digital human interaction processing method according to claim 9, wherein before loading the preset plug-in and the preset prompt word template associated with the artificial intelligence big model according to the user intention and the semantic information, the method further comprises: and registering the plug-in, adding a plug-in name, a function description and plug-in parameters to the registering plug-in, and persisting the registering plug-in, wherein the artificial intelligence large model can be inferred to the plug-in name based on the function description, and can finish the linking of the artificial intelligence large model and the plug-in according to the plug-in name and the plug-in parameters.

11. The virtual digital human interaction based processing method of claim 1, comprising: training a preset deep learning model by using a preset training sample, and adjusting parameters of the preset deep learning model to optimize until a preset verification condition is reached, so as to obtain at least one artificial intelligent large model.

12. The virtual digital human interaction based processing method of claim 1, wherein outputting the document vector if the document vector associated with the text feature vector matches within the preset document vector library comprises:

respectively carrying out similarity calculation on each document vector according to the text feature vector to obtain vector similarity between the text feature vector and each document vector;

and determining a document vector associated with the text feature vector from the preset document vector library according to the vector similarity.

13. The virtual digital human interaction based processing method of claim 12, wherein constructing the preset document vector library comprises:

pre-acquiring initial document data, splitting the initial document data to obtain a plurality of text fragments, and generating text vectors corresponding to the text fragments;

and establishing a preset document vector library according to the corresponding relation between the text fragments and the text vectors.

14. The method of claim 13, wherein splitting the initial document data to obtain a plurality of text segments comprises:

15. A virtual digital human interaction based processing system, the system comprising:

the text conversion module is used for receiving input voice information, converting the voice information into text information and extracting features to obtain text feature vectors;

the first interaction module is used for inputting the text feature vector into a preset document vector library for matching; if the document vector which is associated with the text feature vector is matched in the preset document vector library, outputting the document vector;

and the second interaction module is used for responding to the text feature vector by utilizing the artificial intelligence large model and outputting a response result if the document vector associated with the text feature vector is not matched in the preset document vector library.

16. A mobile terminal, characterized by a memory, a processor and a communication component; the memory is used for storing a computer program; the processor is configured to execute the computer program to implement a virtual digital based human interaction processing method as claimed in one or more of claims 1-14.

17. A vehicle-mounted terminal, characterized by one or more processors; and one or more machine readable media having instructions stored thereon, which when executed by the one or more processors, implement the virtual digital based human interaction processing method of one or more of claims 1-14.

18. A virtual digital human interaction-based processing device, comprising:

one or more processors; and

one or more machine readable media having instructions stored thereon, which when executed by the one or more processors, implement the virtual digital based human interaction processing method of one or more of claims 1-14.

19. One or more machine readable media having instructions stored thereon, which when executed by one or more processors, implement a virtual digital human interaction based processing method as recited in one or more of claims 1-14.