CN118314255A - Display method, apparatus, device, readable storage medium, and computer program product - Google Patents

Display method, apparatus, device, readable storage medium, and computer program product Download PDF

Info

Publication number
CN118314255A
CN118314255A CN202410404201.1A CN202410404201A CN118314255A CN 118314255 A CN118314255 A CN 118314255A CN 202410404201 A CN202410404201 A CN 202410404201A CN 118314255 A CN118314255 A CN 118314255A
Authority
CN
China
Prior art keywords
key point
point data
data
dialogue
bone
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202410404201.1A
Other languages
Chinese (zh)
Inventor
胡良军
洪毅强
王�琦
张建立
刘泽凡
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Migu Cultural Technology Co Ltd
China Mobile Communications Group Co Ltd
MIGU Comic Co Ltd
Original Assignee
Migu Cultural Technology Co Ltd
China Mobile Communications Group Co Ltd
MIGU Comic Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Migu Cultural Technology Co Ltd, China Mobile Communications Group Co Ltd, MIGU Comic Co Ltd filed Critical Migu Cultural Technology Co Ltd
Priority to CN202410404201.1A priority Critical patent/CN118314255A/en
Publication of CN118314255A publication Critical patent/CN118314255A/en
Pending legal-status Critical Current

Links

Landscapes

  • Processing Or Creating Images (AREA)

Abstract

The application discloses a display method, a device, equipment, a readable storage medium and a computer program product, wherein the method of the embodiment of the application comprises the following steps: acquiring dialogue content of a user and a virtual object and video data in the dialogue process of the user and the virtual object; identifying the dialogue content to obtain a dialogue identification result, and identifying the video data to obtain first skeleton key point data of the user; determining second bone key point data according to the dialogue identification result and the first bone key point data; and controlling the display of the virtual object according to the second skeleton key point data.

Description

Display method, apparatus, device, readable storage medium, and computer program product
Technical Field
The present application belongs to the technical field of communication, and in particular relates to a display method, apparatus, device, readable storage medium and computer program product.
Background
At present, when the meta-universe scene experiences, most virtual objects have no limb action or single action when in conversation with a user, so that the user is given a feeling of not being in real scene conversation chat, and the interactive immersion feeling is poor, so that the user experience is poor.
Disclosure of Invention
The embodiment of the application provides a display method, a device, equipment, a readable storage medium and a computer program product, which can solve the problems of poor interaction immersion feeling and poor user experience of the existing virtual object and a user.
In a first aspect, a display method is provided, including:
acquiring dialogue content of a user and a virtual object and video data in the dialogue process of the user and the virtual object;
Identifying the dialogue content to obtain a dialogue identification result, and identifying the video data to obtain first skeleton key point data of the user;
determining second bone key point data according to the dialogue identification result and the first bone key point data;
and controlling the display of the virtual object according to the second skeleton key point data.
In some embodiments, the identifying the video data to obtain first bone keypoint data includes:
Determining an image frame sequence according to the video data;
carrying out attitude estimation on the image frame sequence to obtain two-dimensional key point data;
and filtering the two-dimensional key point data to obtain the first skeleton key point data.
In some embodiments, the performing pose estimation on the image frame sequence to obtain two-dimensional keypoint data includes:
determining N image frames according to the image frame sequence;
Sequentially inputting the N image frames into a gesture estimation model to obtain N two-dimensional key point data, wherein the gesture estimation model is a convolutional neural network model;
wherein N is a positive integer greater than 1.
In some embodiments, the filtering the two-dimensional keypoint data to obtain the first bone keypoint data includes:
performing adjacent frame subtraction processing on the N two-dimensional key point data to obtain N-1 key point differential data;
Determining self-adaptive stable time sequence differential data according to the N two-dimensional key point data and the N-1 key point differential data;
Inputting the self-adaptive stable time sequence differential data and the N two-dimensional key point data into a filtering model to obtain the first skeleton key point data;
the network structure of the filtering model comprises a full-connection residual layer.
In some embodiments, the determining adaptive stable timing differential data from the N two-dimensional keypoint data and the N-1 keypoint differential data includes:
determining the change rate of the key point data of two adjacent frames according to the N two-dimensional key point data and the N-1 key point difference data;
determining a mean value of the change rates of the key point data according to the change rates of the key point data of the two adjacent frames;
and determining the self-adaptive stable time sequence differential data according to the change rate of the key point data of the two adjacent frames, the average value of the change rate of the key point data and the N-1 key point differential data.
In some embodiments, the determining second bone keypoint data from the dialog recognition result and the first bone keypoint data includes:
determining a plurality of first matching results in a virtual object corpus in a virtual object database according to the dialogue identification result; the dialogue recognition result comprises emotion values and dialogue scenes, the virtual object database comprises a virtual object corpus and a virtual object action library, the virtual object corpus comprises a plurality of pre-stored text contents, the virtual object action library comprises a plurality of pre-stored skeleton key point data, and each pre-stored skeleton key point data respectively has a corresponding emotion value and dialogue scene;
Calculating the similarity between each first matching result and the dialogue recognition result;
Under the condition that the similarity between all the first matching results and the dialogue identification results exceeds a preset similarity range, determining first pre-stored bone key point data corresponding to emotion values in the dialogue identification results in the virtual object action library as the second bone key point data;
Under the condition that first matching results with similarity within a preset similarity range exist in the plurality of first matching results, determining at least one second pre-stored bone key point data in the virtual object action library according to dialogue scenes in the dialogue identification results;
And determining second bone key point data in the at least one second pre-stored bone key point data according to the dialogue scene and the first bone key point data in the dialogue identification result.
In some embodiments, the determining the second bone keypoint data from the at least one second pre-stored bone keypoint data according to the dialogue scene and the first bone keypoint data in the dialogue recognition result includes:
calculating a first similarity of each second pre-stored bone key point data according to a dialogue scene, a first weight and each second pre-stored bone key point data in the dialogue identification result;
Calculating a second similarity of each second pre-stored bone keypoint data according to the first bone keypoint data, the second weight and each second pre-stored bone keypoint data;
determining a third similarity of each second pre-stored bone key point data according to the first similarity and the second similarity;
And determining second pre-stored bone key point data which meet a preset condition according to the third similarity as the second bone key point data.
In a second aspect, there is provided a display device including:
The acquisition module is used for acquiring dialogue contents of a user and a virtual object and video data in the dialogue process of the user and the virtual object;
The identification module is used for identifying the dialogue content to obtain a dialogue identification result, and identifying the video data to obtain first skeleton key point data of the user;
The determining module is used for determining second bone key point data according to the dialogue identification result and the first bone key point data;
And the display module is used for controlling the display of the virtual object according to the second skeleton key point data.
In some embodiments, the identification module is configured to:
Determining an image frame sequence according to the video data;
carrying out attitude estimation on the image frame sequence to obtain two-dimensional key point data;
and filtering the two-dimensional key point data to obtain the first skeleton key point data.
In some embodiments, the identification module is configured to:
determining N image frames according to the image frame sequence;
Sequentially inputting the N image frames into a gesture estimation model to obtain N two-dimensional key point data, wherein the gesture estimation model is a convolutional neural network model;
wherein N is a positive integer greater than 1.
In some embodiments, the identification module is configured to:
performing adjacent frame subtraction processing on the N two-dimensional key point data to obtain N-1 key point differential data;
Determining self-adaptive stable time sequence differential data according to the N two-dimensional key point data and the N-1 key point differential data;
Inputting the self-adaptive stable time sequence differential data and the N two-dimensional key point data into a filtering model to obtain the first skeleton key point data;
the network structure of the filtering model comprises a full-connection residual layer.
In some embodiments, the identification module is configured to:
determining the change rate of the key point data of two adjacent frames according to the N two-dimensional key point data and the N-1 key point difference data;
determining a mean value of the change rates of the key point data according to the change rates of the key point data of the two adjacent frames;
and determining the self-adaptive stable time sequence differential data according to the change rate of the key point data of the two adjacent frames, the average value of the change rate of the key point data and the N-1 key point differential data.
In some embodiments, the determining module is configured to:
determining a plurality of first matching results in a virtual object corpus in a virtual object database according to the dialogue identification result; the dialogue recognition result comprises emotion values and dialogue scenes, the virtual object database comprises a virtual object corpus and a virtual object action library, the virtual object corpus comprises a plurality of pre-stored text contents, the virtual object action library comprises a plurality of pre-stored skeleton key point data, and each pre-stored skeleton key point data respectively has a corresponding emotion value and dialogue scene;
Calculating the similarity between each first matching result and the dialogue recognition result;
Under the condition that the similarity between all the first matching results and the dialogue identification results exceeds a preset similarity range, determining first pre-stored bone key point data corresponding to emotion values in the dialogue identification results in the virtual object action library as the second bone key point data;
Under the condition that first matching results with similarity within a preset similarity range exist in the plurality of first matching results, determining at least one second pre-stored bone key point data in the virtual object action library according to dialogue scenes in the dialogue identification results;
And determining second bone key point data in the at least one second pre-stored bone key point data according to the dialogue scene and the first bone key point data in the dialogue identification result.
In some embodiments, the determining module is configured to:
calculating a first similarity of each second pre-stored bone key point data according to a dialogue scene, a first weight and each second pre-stored bone key point data in the dialogue identification result;
Calculating a second similarity of each second pre-stored bone keypoint data according to the first bone keypoint data, the second weight and each second pre-stored bone keypoint data;
determining a third similarity of each second pre-stored bone key point data according to the first similarity and the second similarity;
And determining second pre-stored bone key point data which meet a preset condition according to the third similarity as the second bone key point data.
In a third aspect, there is provided an apparatus comprising a processor and a memory storing a program or instructions executable on the processor, which when executed by the processor, performs the steps of the method according to the first aspect.
In a fourth aspect, there is provided a readable storage medium having stored thereon a program or instructions which when executed by a processor perform the steps of the method according to the first aspect.
In a fifth aspect, a chip is provided, the chip comprising a processor and a communication interface, the communication interface being coupled to the processor, the processor being configured to execute programs or instructions for implementing the method according to the first aspect.
In a sixth aspect, there is provided a computer program/program product stored in a storage medium, the program/program product being executed by at least one processor to implement the method according to the first aspect.
In the embodiment of the application, a dialogue identification result is obtained based on dialogue content identification of the user and the virtual object, first bone key point data of the user is obtained based on video data identification in the dialogue process of the user and the virtual object, the first bone key point data is determined based on the dialogue identification result and the first bone key point data, second bone key point data is determined, and the display of the virtual object is determined according to the second bone key point data. In the process of interaction between the user and the virtual object, the skeleton key point data of the virtual object can be correspondingly matched based on the user behavior in real time, the virtual object is driven to execute corresponding actions based on the matched skeleton key point data, the virtual object can execute corresponding feedback behaviors based on the user behavior, the immersion of interaction between the user and the virtual object is enhanced, and the user experience is improved.
Drawings
FIG. 1 is a schematic flow chart of a display method according to an embodiment of the present application;
FIG. 2a is a schematic diagram of an application of a display method according to an embodiment of the present application;
FIG. 2b is a second embodiment of a display method according to the present application;
FIG. 2c is a third embodiment of a display method according to the present application;
FIG. 2d is a schematic diagram illustrating an application of the display method according to the embodiment of the present application;
FIG. 2e is a fifth embodiment of a display method according to the present application;
FIG. 2f is a schematic diagram of an application of the display method according to the embodiment of the present application;
FIG. 2g is a diagram of a display method according to an embodiment of the present application;
FIG. 2h is a schematic diagram illustrating an application of the display method according to the embodiment of the present application;
FIG. 2i is a diagram illustrating an application of the display method according to the embodiment of the present application;
FIG. 2j is a schematic diagram of an application of the display method according to the embodiment of the present application;
FIG. 2k is an eleventh application diagram of a display method according to an embodiment of the present application;
FIG. 2l is a schematic diagram of an embodiment of the present application;
FIG. 2m is a diagram illustrating a display method according to an embodiment of the present application;
Fig. 3 is a schematic structural diagram of a display device according to an embodiment of the present application;
FIG. 4 is a schematic diagram of an apparatus according to an embodiment of the present application;
FIG. 5 is a second schematic diagram of an apparatus according to an embodiment of the present application.
Detailed Description
The technical solutions of the embodiments of the present application will be clearly described below with reference to the drawings in the embodiments of the present application, and it is apparent that the described embodiments are some embodiments of the present application, but not all embodiments. All other embodiments, which are derived by a person skilled in the art based on the embodiments of the application, fall within the scope of protection of the application.
The terms "first," "second," and the like, herein, are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the terms so used are interchangeable under appropriate circumstances such that the embodiments of the application are capable of operation in sequences other than those illustrated or otherwise described herein, and that the "first" and "second" distinguishing between objects generally are not limited in number to the extent that the first object may, for example, be one or more. Furthermore, "and/or" in the present application means at least one of the connected objects. For example, "a or B" encompasses three schemes, scheme one: including a and excluding B; scheme II: including B and excluding a; scheme III: both a and B. The character "/" generally indicates that the context-dependent object is an "or" relationship.
The term "indication" according to the application may be either a direct indication (or an explicit indication) or an indirect indication (or an implicit indication). The direct indication may be understood that the sender explicitly informs the specific information of the receiver, the operation to be executed, the request result, and other contents in the sent indication; the indirect indication may be understood as that the receiving side determines corresponding information according to the indication sent by the sending side, or determines and determines an operation or a request result to be executed according to a determination result.
The display method provided by the embodiment of the application is described in detail below through some embodiments and application scenes thereof with reference to the accompanying drawings.
Referring to fig. 1, an embodiment of the present application provides a display method, including:
step 101: acquiring dialogue content between a user and a virtual object and video data in the dialogue process between the user and the virtual object;
Step 102: identifying dialogue content to obtain dialogue identification results, and identifying video data to obtain first bone key point data of a user;
step 103: determining second bone key point data according to the dialogue recognition result and the first bone key point data;
step 104: and controlling the display of the virtual object according to the second bone key point data.
The application scene aimed by the embodiment of the application can be a scene in which a user interacts with a digital person in a meta universe or a scene in which a user interacts with other types of virtual objects, such as a scene in which a user interacts with a game non-player character (non-PLAYER CHARACTE, NPC) in a virtual game scene, and the embodiment of the application does not limit the specific types of the application scene and the virtual objects, so long as the scene in which the user interacts with the virtual objects and related data related in the scheme of the application can be obtained is satisfied.
The dialogue content between the user and the virtual object may be determined based on dialogue content input by the user, for example, text information input by the user, dialogue content for dialogue with the virtual object may be extracted based on the text information, or speech information input by the user, dialogue content for dialogue with the virtual object may be extracted based on a speech recognition function, and implementation may be realized based on an existing dialogue content extraction method.
The video data in the process of the user and the virtual object session refers to video data shot by video shooting equipment in the process of the user and the virtual object session, wherein the video shooting equipment can be an external camera or intelligent equipment worn by the user, such as VR equipment, and the like.
In the embodiment of the application, a dialogue identification result is obtained based on dialogue content identification of the user and the virtual object, first bone key point data of the user is obtained based on video data identification in the dialogue process of the user and the virtual object, the first bone key point data is determined based on the dialogue identification result and the first bone key point data, second bone key point data is determined, and the display of the virtual object is determined according to the second bone key point data. In the process of interaction between the user and the virtual object, the skeleton key point data of the virtual object can be correspondingly matched based on the user behavior in real time, the virtual object is driven to execute corresponding actions based on the matched skeleton key point data, the virtual object can execute corresponding feedback behaviors based on the user behavior, the immersion of interaction between the user and the virtual object is enhanced, and the user experience is improved.
It should be noted that, the above-mentioned recognition of the dialogue content to obtain the dialogue recognition result may be specifically implemented based on the existing semantic recognition technology.
In some embodiments, identifying the video data to obtain first bone keypoint data includes:
(1) Determining a sequence of image frames from the video data;
(2) Carrying out attitude estimation on the image frame sequence to obtain two-dimensional key point data;
(3) And filtering the two-dimensional key point data to obtain first skeleton key point data.
In the embodiment of the application, the identification processing of the video data is specifically as follows: the video data is divided into a plurality of image sequences, and then the image sequences are subjected to gesture estimation processing, namely human body 2D key points are predicted through a human body gesture estimation network, the situation that the key points are dithered when a deep learning model is directly applied to output the human body 2D key points predicted by the human body gesture estimation network is considered, and further filtering preprocessing is needed to be carried out on model output results, so that the accuracy of data processing is improved.
In some embodiments, performing pose estimation on a sequence of image frames to obtain two-dimensional keypoint data includes:
(1) Determining N image frames according to the image frame sequence;
(2) Sequentially inputting N image frames into a gesture estimation model to obtain N two-dimensional key point data, wherein the gesture estimation model is a convolutional neural network model;
Wherein N is a positive integer greater than 1.
In the embodiment of the application, the number N of image frames to be filtered is set, and the N image frames are sequentially input into the gesture estimation model to obtain N two-dimensional key point data.
It can be understood that the above N may be equal to the total number of image frames in the image frame sequence, i.e. the pose estimation is performed on all the image frames, or the above N may be smaller than the total number of image frames in the image frame sequence, i.e. the pose estimation is performed on part of the image frames in all the image frames, which is flexibly set according to the actual technical requirements.
In some embodiments, filtering the two-dimensional keypoint data to obtain first bone keypoint data comprises:
(1) Performing adjacent frame subtraction processing on the N two-dimensional key point data to obtain N-1 key point differential data;
(2) Determining self-adaptive stable time sequence differential data according to the N two-dimensional key point data and the N-1 key point differential data;
(3) Inputting the self-adaptive stable time sequence differential data and N two-dimensional key point data into a filtering model to obtain first skeleton key point data;
the network structure of the filtering model comprises a full-connection residual layer.
In the embodiment of the application, self-adaptive stable time sequence differential data are determined based on N two-dimensional key point data and N-1 key point differential data between adjacent frames, and then first skeleton key point data are obtained through prediction of a filtering model with a full-connection residual layer.
The self-adaptive stable time sequence differential data can effectively stabilize N two-dimensional key point data, and when the jump of certain key point data is larger, the residual error value is reduced in a self-adaptive manner, so that the jump range of a prediction result is narrowed.
In some embodiments, determining adaptive stable timing differential data from the N two-dimensional keypoint data and the N-1 keypoint differential data comprises:
(1) Determining the change rate of the key point data of two adjacent frames according to the N two-dimensional key point data and the N-1 key point difference data;
(2) Determining a mean value of the change rates of the key point data according to the change rates of the key point data of two adjacent frames;
(3) And determining self-adaptive stable time sequence differential data according to the change rate of the key point data of two adjacent frames, the average value of the change rate of the key point data and N-1 key point differential data.
In the embodiment of the application, the average value of the change rates of the key point data of N two-dimensional key point data is calculated, the change rate of the key point data of two adjacent frames is calculated, then the calculation of self-adaptive stable time sequence differential data is realized through conversion, and finally the self-adaptive stable time sequence differential data obtained through calculation is used as the branch input of a filter network
In some embodiments, determining second bone keypoint data from the dialog recognition result and the first bone keypoint data comprises:
(1) Determining a plurality of first matching results in a virtual object corpus in a virtual object database according to the dialogue recognition results;
The dialogue recognition result comprises emotion values (for example, the dialogue emotion of a user can be indicated to be positive, negative or neutral) and dialogue scenes, the virtual object database comprises a virtual object corpus and a virtual object action library, the virtual object corpus comprises a plurality of pre-stored text contents, the virtual object action library comprises a plurality of pre-stored skeleton key point data, and each pre-stored skeleton key point data respectively has a corresponding emotion value and dialogue scene;
For example: the NPC database comprises an NPC human body action database and an NPC corpus, the creation of the database mainly comes from a large amount of film dialogue scene data and short video dialogue scene data, and the collected data are converted into dialogue text content feature vector conversion and human body action video conversion into human body skeleton driving data in batches. The basic action library in the NPC human body action library mainly comprises common actions such as waving hands, bowing, blessing and the like.
It should be noted that, the establishment of the NPC human body action library and the NPC corpus may be based on the prior art, or the user dialogue recognition result and the recognition processing method of the first bone key point data may also be used, that is, the dialogue content material of the mobile phone NPC in a large amount of movie dialogue scene data and short video dialogue scene data is used in the same manner as the user dialogue recognition, and the bone key point data of the NPC is collected from a large amount of movie dialogue scene data and short video dialogue scene data correspondingly in the same manner as the user bone key point data recognition.
It can be understood that when the virtual object database is built, each pre-stored skeletal key point data in the virtual object action library has a corresponding emotion value and a corresponding dialogue scene, so that the corresponding skeletal key point data can be matched in the virtual object action library based on the dialogue recognition result of the user.
(2) Calculating the similarity between each first matching result and the dialogue recognition result; executing (3) or executing (4) and (5) according to the result;
Firstly, matching dialogue recognition results recognized by dialogue contents of a user with a virtual object corpus to obtain a plurality of groups of similar text matching results, and then respectively calculating the similarity between each first matching result and the dialogue recognition results, wherein specific similarity calculation needs to be combined with emotion values and dialogue scene synthesis to calculate, and a specific calculation method can adopt a technical similarity calculation method, such as a Euclidean distance similarity calculation method.
(3) Under the condition that the similarity between all the first matching results and the dialogue identification results exceeds a preset similarity range, determining first pre-stored bone key point data corresponding to emotion values in the dialogue identification results in a virtual object action library as second bone key point data;
The preset similarity range may be a manually set similarity threshold, which is used for filtering the first matching result. And calculating the Euclidean distance similarity, if the similarity calculation results of all the first matching results are larger than a threshold value, the overall similarity of the first matching results is poor, and at the moment, the first pre-stored bone key point data serving as second bone key point data are matched in the virtual object action library through the emotion values.
(4) Under the condition that a plurality of first matching results exist, the similarity of which is within a preset similarity range, determining at least one second pre-stored bone key point data in a virtual object action library according to a dialogue scene in a dialogue identification result;
(5) And determining second bone key point data in at least one second pre-stored bone key point data according to the dialogue scene and the first bone key point data in the dialogue identification result.
And calculating the Euclidean distance similarity, if the similarity calculation results of the first matching results are smaller than a threshold value, indicating that at least part of the first matching results have higher similarity, and further combining the dialogue scene and the first bone key point data to perform matching selection of second bone key point data.
In some embodiments, determining second bone keypoint data from at least one second pre-stored bone keypoint data based on the dialog scene and the first bone keypoint data in the dialog recognition result comprises:
(1) According to the dialogue scene, the first weight and each second pre-stored bone key point data in the dialogue identification result, calculating the first similarity of each second pre-stored bone key point data;
Calculating the similarity of the dialogue scene and the second pre-stored bone key point data, and adding corresponding weight; it is understood that the specific weights may be set based on empirical values.
(2) Calculating a second similarity of each second pre-stored bone key point data according to the first bone key point data, the second weight and each second pre-stored bone key point data;
Calculating the similarity between the first bone key point data and the second pre-stored bone key point data, and adding corresponding weight; it is understood that the specific weights may be set based on empirical values.
(3) Determining a third similarity of each second pre-stored bone key point data according to the first similarity and the second similarity;
After the similarity of the dialogue scene and the second pre-stored bone key point data and the similarity of the first bone key point data and the second pre-stored bone key point data are calculated, further weighting calculation is carried out based on the weights of the two data, and a final similarity result is obtained.
(4) And determining the second pre-stored bone key point data which meet the preset condition according to the third similarity as second bone key point data.
And determining the second bone key point data by utilizing the final similarity result, wherein the preset condition can be a similarity threshold value or can be a direct selection similarity optimal one.
The following solutions describe the technical scheme of the present application further:
the general flow chart of the scheme is shown in fig. 2 a:
The main technical steps are as follows:
The user terminal:
step one, acquiring dialogue content of a user and NPC and identifying dialogue scenes according to semantic analysis;
step two, acquiring video stream data of a user, and identifying skeleton key points of human body actions through skeleton driving of the scheme;
thirdly, carrying out quantization weighting calculation on the results of the first step and the second step, and subsequently using the calculation result for NPC action matching;
NPC end:
Step one, constructing an NPC database (comprising an NPC human body action library and an NPC corpus);
Step two, performing action matching on the user terminal identification result and the NPC action library;
step three, obtaining a matched result to perform action driving on the NPC;
The detailed steps are set forth below:
User terminal
Step one, semantic analysis of user dialog content
The flow of the semantic analysis module is mainly as shown in fig. 2 b:
(1) Firstly, establishing a corpus by collected batch dialogue scene data (text and video);
(2) Performing data preprocessing on training text data, including word segmentation, word stopping removal and the like;
(3) Carrying out characteristic engineering treatment on the preprocessed data, such as TF-IDF, word2Vec and Bert models;
(4) Training models (emotion classification model, word2Vec feature vector model);
(5) Inputting a user dialogue text, performing the processing of the steps (2) - (3), and inputting the processed result into a trained model;
(6) Outputting a result, wherein the result comprises user emotion analysis (the scheme is mainly divided into 3 types, namely positive, negative and neutral), and n groups of data (text and video) which are most similar in a matched dialogue scene library;
The module can adopt the prior art, and the step is that the text corpus is subjected to feature vector coding so as to match the dialogue content of the user with the text corpus; some of the techniques are described below (in terms of Word2 Vec):
The Word2Vec model contains two methods of training Word vectors: continuous word bags (continuous bag of words, CBOW) and continuous skip-grams. The basic idea in CBOW is to predict this keyword by words of its context: whereas the basic idea of skip-gram is contrary to CBOW, the word that predicts the context by this keyword. As shown in fig. 2 c:
word2Vec training procedure is as follows:
(1)CBOW:
Step1, firstly, carrying out single-heat coding on the context words of the keywords, wherein the length of the obtained vector is the length of a word list (the total number of different words in all training corpus), the shape is a vector of 1 XV, the positions of the words are 1, and the rest positions are 0.
Step2, multiplying the single thermal code of each context word by a weight matrix W of V x N, and multiplying each word by the same weight matrix W. A plurality of 1 xn vectors are obtained.
Step3 for such a plurality of 1×n vectors, an addition averaging method is used to integrate into one 1×n vector.
Step 4. This 1 XN vector is then combined with the N X V matrix W' corresponding to the key to obtain a 1X V vector.
Step5 for this 1×v vector, the vector obtained by softmax layer normalization is required to be the predictive vector of the key, which is not uniheat coded, but is composed of probabilities of many floating point values. The position with the highest probability is the position with the single hot code of the key word being 1.
Step6, performing an error calculation on the predictive vector of the keyword and the label vector (the vector of the single thermal coding), and generally adopting cross entropy.
Step7, the error is reversely transmitted back to the neuron, and the error is reversely transmitted after each forward calculation, so that the aim of adjusting the weight matrixes W and W' is fulfilled, and the method is similar to a BP neural network.
When the loss reaches the optimum, the training is finished, the weight matrix needed by the user is obtained, and word vectors can be formed according to the input independent heat vectors through the weight matrix.
(2)Skip-gram
Step1, firstly, performing one-time thermal coding on the keywords.
Step2, multiplying the one-hot code of each keyword by a weight matrix W of V multiplied by N.
Step 3. Then multiply the 1 XN vector for this keyword with the N X V matrix W' of the contextual word vectors (one in common) for the keyword, resulting in a plurality of 1 XV vectors.
Step4 vectors normalized by the softmax layer are required to be predictive vectors of keyword context words for these 1×v vectors.
Step5, performing an error calculation on the predictive vector of the keyword context word and the label vector (the vector of the single hot coding), and summing a plurality of obtained cross entropy losses by adopting cross entropy.
Step6, the error is reversely transmitted back to the neuron, and the error is reversely transmitted after each forward calculation, so that the aim of adjusting the weight matrixes W and W' is fulfilled.
Step7, finally obtaining the weight matrix forming the word vector as W.
Step two, obtaining video stream data and outputting key points of human bones according to a bone driving module
The step mainly describes a skeleton driving module, the scheme of the module provides an innovative key point filtering method, a self-adaptive residual error mechanism is added, so that the human body action driving is more stable effectively, the technical implementation flow is shown in fig. 2D, the correspondingly obtained key points of the human body 2D are shown in fig. 2e, and the main implementation steps are as follows:
step 1. Read video stream and divide into image frame sequences
Step2 image preprocessing
Positioning to obtain a human body region, expanding the center point of the human body region outwards by 1.2 times, performing matting processing according to the proportion of the length-width ratio of 16:9, and resetting the scratched image to the size designated by the model input for training and prediction.
Example image preprocessing is shown in the lower diagram in fig. 2 f. And (3) outwards expanding the detection frame of the human body region obtained through positioning according to the center point by 1.2 times (as shown by a dashed line frame with a large drawing), then carrying out the training in the matting input model according to the proportion of the length-width ratio of 16:9, and carrying out outwards expanding according to the proportion of 16:9 and carrying out the 0 supplementing treatment on the exceeding region if the human body detection frame is detected as shown in the upper right corner in the figure 2 f. The 16:9 image matting processing is performed on the human body detection frame mainly because the size of the input setting of the skeleton key point model is 16:9, so that the situation that the image size after the image matting is deformed to the designated size of the model is not generated is ensured, and the shape proportion of the human body is ensured to be consistent with that of the original video.
Step3 human body posture estimation network predicts 2D key points of human body
In the scheme, a human body posture estimation network model mainly adopts MobilenetV a2 network structure (a lightweight convolutional neural network and depth separable convolution is used), a human body 2D key point model flow chart is shown in the following figure 2g, and a MobilenetV2 skeleton network structure is shown in the figures 2h and 2 i:
Step4 Key Point Filter pretreatment
The human body 2D key points predicted by the human body posture estimation network are obtained from Step3, the condition that the key points can shake is directly output by applying the deep learning model, and further filtering pretreatment is needed to be carried out on the model output result, and the human body 2D key point filtering pretreatment steps of the scheme mainly comprise:
(1) Setting the number N of image frames to be filtered, and sequentially inputting a human body posture estimation network model to obtain predicted human body 2D key point data;
(2) Performing subtraction processing on adjacent frames on the human body 2D key point data corresponding to the N groups of image frames to obtain N-1 key point differential data; this step is shown in fig. 2j and 2 k.
(3) And (3) inputting the data processed in the step (2) into a human body key point filtering model provided by the scheme for training and prediction, wherein the model structure is shown in figure 2 l. The human body key point filtering model is input into N frames of 2D key point data and N-1 frames of time sequence differential data with self-adaptive stability; outputting N frames of 2D key point data after filtering by the model; the human body key point filtering model has the function of predicting the filtered N frame 2D key point data by adding the time sequence difference data of the N frame 2D key point and the N-1 frame key point self-adapting stability as input, and effectively preventing the key point from jumping and shaking and the like. N frames of 2D keypoint data: for each frame of data, there are 14 skeletal keypoints, and the coordinates (x, y) of each keypoint represent the position of the keypoint in the frame of image. The right branch of the network receives data of continuous N frames at the same time, and the dimension is 14 x 2 x N. N-1 frame key point self-adaptive preprocessing time sequence differential data: in order to make the prediction of each frame of key point more stable, the scheme calculates the average value of the change rate of the key point positions of N frames of imagesAnd calculating the change rate alpha i,n of key points of two adjacent frames of images, realizing the calculation of self-adaptive stable time sequence differential data through a conversion formula resd i,n of the change rate of the key points, and finally taking the self-adaptive stable time sequence differential data obtained by calculation as the left branch input of a network, wherein the data dimension is 14 x2 x (N-1).
The adaptive stable time sequence differential data calculation formula is as follows:
Two adjacent frames of key point difference: d i,n=pi,n-pi,n-1
Rate of change of key points of two adjacent frames
N frames of image key point position change rate mean value:
Adaptive stable timing differential data:
The network loss function while optimizing error during training is as follows:
Error is defined as
The total loss function is: loss=l P
Where N is the number of sequence frames, selected herein as 8.i represents a key point index, p i,n is the coordinates predicted by the human body posture network of the figure 2e of the ith key point of the nth frame, and p i,n-1 is the coordinates predicted by the human body posture network of the figure 2e of the ith key point of the n-1 th frame; k i,n is the predicted coordinates of the filtering network structure of fig. 2e of the ith key point of the nth frame; g i,n is the real coordinates of the ith key point of the nth frame. The adaptive stable time sequence differential data designed by the scheme effectively stabilizes 14 human skeleton key point coordinate data, and when certain key point jump is larger, the residual error value is reduced in an adaptive manner, so that the jump range of a prediction result is narrowed.
The generation of training data is divided into two parts, one part being derived from a 2D body posture network, such as openpose. And the other part is generated by adopting a mode of adding random noise disturbance to the true value of the key point. The disturbance mode is as follows:
spi,n=gi,n*rand(0.8,1.2)
sp i,n denotes the noisy input generated and rand (0.8,1.2) denotes the generation of a random number in the 0.8,1.2 interval. Mixing the two data makes the training data more diverse, and can improve the generalization capability of the model.
When the frame number F < =N, the processing is not carried out, and when F > =N, the N frame data of [ F-N, F ] are transmitted into the network, so that a filtered result is obtained.
Step three, NPC action matching
The first step and the second step acquire data such as emotion analysis, dialogue scene matching results and human action skeleton key points of a user after the user semantic analysis module; in order to effectively match the corresponding limb actions of the NPC, the third step is mainly used for matching the optimal limb actions of the NPC.
The NPC human motion matching flow chart is shown in fig. 2 m:
the semantic analysis module in the first step is carried out according to the dialogue content of the user, and the semantic analysis module is matched with the NPC corpus to output n groups of similar text matching results;
Setting a similarity threshold value, and filtering the matching result of the step (1);
If the filtered results are not in the threshold range, outputting emotion values (positive, negative or neutral) of the user according to emotion analysis results, and creating corresponding emotion labels in advance by a basic action library in the NPC action library so as to match the emotion of the user and enable the NPC to generate corresponding action driving.
Judging whether the filtered matching result is greater than 1 group, if so, carrying out similarity calculation on the skeleton key points of the user and the skeleton key point vectors of the dialogue scene to obtain S1, marking the text similarity as S2, and calculating a weighted value S0= (1-p) S1+p S2, wherein p is an empirical value, and setting the scheme to be 0.6. And (5) sequencing the S0 in a reverse order, and outputting an optimal matching result. And if the matching result is only 1 group, outputting the matching result as optimal matching.
NPC end
Step one, constructing an NPC database
The NPC comprises an NPC human body action library and an NPC corpus, the creation of the database mainly comes from a large amount of film dialogue scene data and short video dialogue scene data, the collected data are converted into dialogue text content feature vector conversion in batches (the implementation steps are the same as the first step of a user side), and the human body action video is converted into human body skeleton driving data (the implementation steps are the same as the second step of the user side). The basic action library in the NPC human body action library mainly comprises common actions such as waving hands, bowing, blessing and the like.
Step two, performing action matching on the user terminal identification result and the NPC action library
The matching mainly uses the Euclidean distance similarity calculation method, and the detailed steps are already described in the step three of the user side.
Step three, obtaining a matched result to drive the NPC to act
The NPC human body motion driving is mainly used for a human body skeleton driving module in the second step of the user end, and corresponding human body skeleton motion data is output so that the NPC drives limbs.
Referring to fig. 3, an embodiment of the present application provides a display apparatus, including:
the acquiring module 301 is configured to acquire a session content of a user and a virtual object and video data in a session process of the user and the virtual object;
The identifying module 302 is configured to identify the dialogue content to obtain a dialogue identification result, and identify the video data to obtain first skeleton key point data of the user;
A determining module 303, configured to determine second bone key point data according to the dialogue identification result and the first bone key point data;
And the display module 304 is configured to control display of the virtual object according to the second skeletal key point data.
In some embodiments, the identification module is configured to:
Determining an image frame sequence according to the video data;
carrying out attitude estimation on the image frame sequence to obtain two-dimensional key point data;
and filtering the two-dimensional key point data to obtain the first skeleton key point data.
In some embodiments, the identification module is configured to:
determining N image frames according to the image frame sequence;
Sequentially inputting the N image frames into a gesture estimation model to obtain N two-dimensional key point data, wherein the gesture estimation model is a convolutional neural network model;
wherein N is a positive integer greater than 1.
In some embodiments, the identification module is configured to:
performing adjacent frame subtraction processing on the N two-dimensional key point data to obtain N-1 key point differential data;
Determining self-adaptive stable time sequence differential data according to the N two-dimensional key point data and the N-1 key point differential data;
Inputting the self-adaptive stable time sequence differential data and the N two-dimensional key point data into a filtering model to obtain the first skeleton key point data;
the network structure of the filtering model comprises a full-connection residual layer.
In some embodiments, the identification module is configured to:
determining the change rate of the key point data of two adjacent frames according to the N two-dimensional key point data and the N-1 key point difference data;
determining a mean value of the change rates of the key point data according to the change rates of the key point data of the two adjacent frames;
and determining the self-adaptive stable time sequence differential data according to the change rate of the key point data of the two adjacent frames, the average value of the change rate of the key point data and the N-1 key point differential data.
In some embodiments, the determining module is configured to:
determining a plurality of first matching results in a virtual object corpus in a virtual object database according to the dialogue identification result; the dialogue recognition result comprises emotion values and dialogue scenes, the virtual object database comprises a virtual object corpus and a virtual object action library, the virtual object corpus comprises a plurality of pre-stored text contents, the virtual object action library comprises a plurality of pre-stored skeleton key point data, and each pre-stored skeleton key point data respectively has a corresponding emotion value and dialogue scene;
Calculating the similarity between each first matching result and the dialogue recognition result;
Under the condition that the similarity between all the first matching results and the dialogue identification results exceeds a preset similarity range, determining first pre-stored bone key point data corresponding to emotion values in the dialogue identification results in the virtual object action library as the second bone key point data;
Under the condition that first matching results with similarity within a preset similarity range exist in the plurality of first matching results, determining at least one second pre-stored bone key point data in the virtual object action library according to dialogue scenes in the dialogue identification results;
And determining second bone key point data in the at least one second pre-stored bone key point data according to the dialogue scene and the first bone key point data in the dialogue identification result.
In some embodiments, the determining module is configured to:
calculating a first similarity of each second pre-stored bone key point data according to a dialogue scene, a first weight and each second pre-stored bone key point data in the dialogue identification result;
Calculating a second similarity of each second pre-stored bone keypoint data according to the first bone keypoint data, the second weight and each second pre-stored bone keypoint data;
determining a third similarity of each second pre-stored bone key point data according to the first similarity and the second similarity;
And determining second pre-stored bone key point data which meet a preset condition according to the third similarity as the second bone key point data.
The display device in the embodiment of the application can be an electronic device, for example, an electronic device with an operating system, or can be a component in the electronic device, for example, an integrated circuit or a chip. The electronic device may be a terminal, or may be other devices than a terminal. By way of example, the terminals may include, but are not limited to, the types of terminals 11 listed above, other devices may be servers, network attached storage (Network Attached Storage, NAS), etc., and embodiments of the present application are not limited in detail.
The display device provided by the embodiment of the application can realize each process realized by the method embodiments of fig. 1 to 2 and achieve the same technical effects, and in order to avoid repetition, the description is omitted here.
As shown in fig. 4, the embodiment of the present application further provides an apparatus 400, which includes a processor 401 and a memory 402, where the memory 402 stores a program or an instruction that can be executed on the processor 401, and the program or the instruction implement each step of the embodiment of the method when executed by the processor 401, and achieve the same technical effect, and are not repeated herein.
The embodiment of the application also provides equipment, which comprises a processor and a communication interface, wherein the communication interface is coupled with the processor, and the processor is used for running programs or instructions to realize the steps of the embodiment of the method. The device embodiment corresponds to the device method embodiment, and each implementation process and implementation manner of the method embodiment can be applied to the device embodiment, and the same technical effect can be achieved.
Specifically, the embodiment of the application also provides equipment. As shown in fig. 5, the apparatus 500 includes: an antenna 51, a radio frequency device 52, a baseband device 53, a processor 54 and a memory 55. The antenna 51 is connected to a radio frequency device 52. In the uplink direction, the radio frequency device 52 receives information via the antenna 51, and transmits the received information to the baseband device 53 for processing. In the downlink direction, the baseband device 53 processes information to be transmitted, and transmits the processed information to the radio frequency device 52, and the radio frequency device 52 processes the received information and transmits the processed information through the antenna 51.
The method performed by the apparatus in the above embodiments may be implemented in a baseband device 53, the baseband device 53 comprising a baseband processor.
The baseband device 53 may, for example, comprise at least one baseband board, on which a plurality of chips are disposed, as shown in fig. 5, where one chip, for example, a baseband processor, is connected to the memory 55 through a bus interface, so as to invoke a program in the memory 55 to perform the network device operation shown in the above method embodiment.
The device may also include a network interface 56, such as a common public radio interface (Common Public Radio Interface, CPRI).
Specifically, the apparatus 500 according to the embodiment of the present application further includes: instructions or programs stored in the memory 55 and executable on the processor 54, the processor 54 invokes the instructions or programs in the memory 55 to perform the methods performed by the modules shown in fig. 2 and achieve the same technical effects, and are not repeated here.
The embodiment of the application also provides a readable storage medium, on which a program or an instruction is stored, which when executed by a processor, implements each process of the above method embodiment, and can achieve the same technical effects, and in order to avoid repetition, the description is omitted here.
Wherein the processor is a processor in the terminal described in the above embodiment. The readable storage medium includes computer readable storage medium such as computer readable memory ROM, random access memory RAM, magnetic or optical disk, etc. In some examples, the readable storage medium may be a non-transitory readable storage medium.
The embodiment of the application further provides a chip, which comprises a processor and a communication interface, wherein the communication interface is coupled with the processor, and the processor is used for running programs or instructions to realize the processes of the embodiment of the method, and can achieve the same technical effects, so that repetition is avoided, and the description is omitted here.
It should be understood that the chips referred to in the embodiments of the present application may also be referred to as system-on-chip chips, or the like.
The embodiments of the present application further provide a computer program/program product stored in a storage medium, where the computer program/program product is executed by at least one processor to implement each process of the above method embodiments, and achieve the same technical effects, and are not repeated herein.
It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element. Furthermore, it should be noted that the scope of the methods and apparatus in the embodiments of the present application is not limited to performing the functions in the order shown or discussed, but may also include performing the functions in a substantially simultaneous manner or in an opposite order depending on the functions involved, e.g., the described methods may be performed in an order different from that described, and various steps may be added, omitted, or combined. Additionally, features described with reference to certain examples may be combined in other examples.
From the description of the embodiments above, it will be apparent to those skilled in the art that the above-described example methods may be implemented by means of a computer software product plus a necessary general purpose hardware platform, but may also be implemented by hardware. The computer software product is stored on a storage medium (such as ROM, RAM, magnetic disk, optical disk, etc.) and includes instructions for causing a terminal or network side device to perform the methods according to the embodiments of the present application.
The embodiments of the present application have been described above with reference to the accompanying drawings, but the present application is not limited to the above-described embodiments, which are merely illustrative and not restrictive, and many forms of embodiments may be made by those of ordinary skill in the art without departing from the spirit of the application and the scope of the claims, which fall within the protection of the present application.

Claims (11)

1.A display method, comprising:
acquiring dialogue content of a user and a virtual object and video data in the dialogue process of the user and the virtual object;
Identifying the dialogue content to obtain a dialogue identification result, and identifying the video data to obtain first skeleton key point data of the user;
determining second bone key point data according to the dialogue identification result and the first bone key point data;
and controlling the display of the virtual object according to the second skeleton key point data.
2. The method of claim 1, wherein the identifying the video data to obtain first bone keypoint data comprises:
Determining an image frame sequence according to the video data;
carrying out attitude estimation on the image frame sequence to obtain two-dimensional key point data;
and filtering the two-dimensional key point data to obtain the first skeleton key point data.
3. The method of claim 2, wherein said performing pose estimation on said sequence of image frames to obtain two-dimensional keypoint data comprises:
determining N image frames according to the image frame sequence;
Sequentially inputting the N image frames into a gesture estimation model to obtain N two-dimensional key point data, wherein the gesture estimation model is a convolutional neural network model;
wherein N is a positive integer greater than 1.
4. A method according to claim 3, wherein filtering the two-dimensional keypoint data to obtain the first skeletal keypoint data comprises:
performing adjacent frame subtraction processing on the N two-dimensional key point data to obtain N-1 key point differential data;
Determining self-adaptive stable time sequence differential data according to the N two-dimensional key point data and the N-1 key point differential data;
Inputting the self-adaptive stable time sequence differential data and the N two-dimensional key point data into a filtering model to obtain the first skeleton key point data;
the network structure of the filtering model comprises a full-connection residual layer.
5. The method of claim 4, wherein said determining adaptive stable timing differential data from said N two-dimensional keypoint data and said N-1 keypoint differential data comprises:
determining the change rate of the key point data of two adjacent frames according to the N two-dimensional key point data and the N-1 key point difference data;
determining a mean value of the change rates of the key point data according to the change rates of the key point data of the two adjacent frames;
and determining the self-adaptive stable time sequence differential data according to the change rate of the key point data of the two adjacent frames, the average value of the change rate of the key point data and the N-1 key point differential data.
6. The method of claim 1, wherein the step of determining the position of the substrate comprises,
The determining second bone key point data according to the dialogue identification result and the first bone key point data comprises the following steps:
determining a plurality of first matching results in a virtual object corpus in a virtual object database according to the dialogue identification result; the dialogue recognition result comprises emotion values and dialogue scenes, the virtual object database comprises a virtual object corpus and a virtual object action library, the virtual object corpus comprises a plurality of pre-stored text contents, the virtual object action library comprises a plurality of pre-stored skeleton key point data, and each pre-stored skeleton key point data respectively has a corresponding emotion value and dialogue scene;
Calculating the similarity between each first matching result and the dialogue recognition result;
Under the condition that the similarity between all the first matching results and the dialogue identification results exceeds a preset similarity range, determining first pre-stored bone key point data corresponding to emotion values in the dialogue identification results in the virtual object action library as the second bone key point data;
Under the condition that first matching results with similarity within a preset similarity range exist in the plurality of first matching results, determining at least one second pre-stored bone key point data in the virtual object action library according to dialogue scenes in the dialogue identification results; and determining second bone key point data in the at least one second pre-stored bone key point data according to the dialogue scene and the first bone key point data in the dialogue identification result.
7. The method of claim 6, wherein said determining said second bone keypoint data from said at least one second pre-stored bone keypoint data based on a dialog scene in said dialog recognition result and said first bone keypoint data comprises:
calculating a first similarity of each second pre-stored bone key point data according to a dialogue scene, a first weight and each second pre-stored bone key point data in the dialogue identification result;
Calculating a second similarity of each second pre-stored bone keypoint data according to the first bone keypoint data, the second weight and each second pre-stored bone keypoint data;
determining a third similarity of each second pre-stored bone key point data according to the first similarity and the second similarity;
And determining second pre-stored bone key point data which meet a preset condition according to the third similarity as the second bone key point data.
8. A display device, comprising:
The acquisition module is used for acquiring dialogue contents of a user and a virtual object and video data in the dialogue process of the user and the virtual object;
The identification module is used for identifying the dialogue content to obtain a dialogue identification result, and identifying the video data to obtain first skeleton key point data of the user;
The determining module is used for determining second bone key point data according to the dialogue identification result and the first bone key point data;
And the display module is used for controlling the display of the virtual object according to the second skeleton key point data.
9. An apparatus comprising a processor and a memory storing a program or instructions executable on the processor, which when executed by the processor, implement the steps of the display method of any one of claims 1 to 7.
10. A readable storage medium, characterized in that the readable storage medium has stored thereon a program or instructions which, when executed by a processor, implement the steps of the display method according to any of claims 1 to 7.
11. A computer program product comprising computer instructions which, when executed by a processor, implement the steps of the display method of any one of claims 1 to 7.
CN202410404201.1A 2024-04-03 2024-04-03 Display method, apparatus, device, readable storage medium, and computer program product Pending CN118314255A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202410404201.1A CN118314255A (en) 2024-04-03 2024-04-03 Display method, apparatus, device, readable storage medium, and computer program product

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202410404201.1A CN118314255A (en) 2024-04-03 2024-04-03 Display method, apparatus, device, readable storage medium, and computer program product

Publications (1)

Publication Number Publication Date
CN118314255A true CN118314255A (en) 2024-07-09

Family

ID=91731310

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202410404201.1A Pending CN118314255A (en) 2024-04-03 2024-04-03 Display method, apparatus, device, readable storage medium, and computer program product

Country Status (1)

Country Link
CN (1) CN118314255A (en)

Similar Documents

Publication Publication Date Title
US10417498B2 (en) Method and system for multi-modal fusion model
KR102387570B1 (en) Method and apparatus of generating facial expression and learning method for generating facial expression
WO2021049199A1 (en) System and method for a dialogue response generation system
CN111476709B (en) Face image processing method and device and electronic equipment
CN112100337B (en) Emotion recognition method and device in interactive dialogue
CN111190600B (en) Method and system for automatically generating front-end codes based on GRU attention model
CN110795549B (en) Short text conversation method, device, equipment and storage medium
CN110795618B (en) Content recommendation method, device, equipment and computer readable storage medium
CN109685068A (en) A kind of image processing method and system based on generation confrontation neural network
CN114943960A (en) Text recognition method, system, electronic equipment and storage medium
CN111553477A (en) Image processing method, device and storage medium
CN111292262A (en) Image processing method, image processing apparatus, electronic device, and storage medium
CN116050496A (en) Determination method and device, medium and equipment of picture description information generation model
CN118246537A (en) Question and answer method, device, equipment and storage medium based on large model
CN113851113A (en) Model training method and device and voice awakening method and device
CN117112766A (en) Visual dialogue method, visual dialogue device, electronic equipment and computer readable storage medium
CN117094365A (en) Training method and device for image-text generation model, electronic equipment and medium
CN118314255A (en) Display method, apparatus, device, readable storage medium, and computer program product
Ahmed et al. Two person interaction recognition based on effective hybrid learning
CN115063606B (en) Interactive training method and device based on multi-modal data
CN113766130A (en) Video shooting method, electronic equipment and device
CN114693904A (en) Text recognition method, model training method, model recognition device and electronic equipment
CN117934551B (en) Mixed reality tracking interaction system
Mizuno et al. Subjective Baggage-Weight Estimation Based on Human Walking Behavior
WO2019141896A1 (en) A method for neural networks

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination