CN118314255A

CN118314255A - Display method, device, equipment, readable storage medium and computer program product

Info

Publication number: CN118314255A
Application number: CN202410404201.1A
Authority: CN
Inventors: 胡良军; 洪毅强; 王�琦; 张建立; 刘泽凡
Original assignee: China Mobile Communications Group Co Ltd; MIGU Culture Technology Co Ltd; MIGU Comic Co Ltd
Current assignee: China Mobile Communications Group Co Ltd; MIGU Culture Technology Co Ltd; MIGU Comic Co Ltd
Priority date: 2024-04-03
Filing date: 2024-04-03
Publication date: 2024-07-09

Abstract

The present application discloses a display method, an apparatus, a device, a readable storage medium and a computer program product. The method of an embodiment of the present application includes: obtaining the conversation content between a user and a virtual object and the video data of the conversation between the user and the virtual object; identifying the conversation content to obtain a conversation recognition result, and identifying the video data to obtain the first skeleton key point data of the user; determining the second skeleton key point data according to the conversation recognition result and the first skeleton key point data; and controlling the display of the virtual object according to the second skeleton key point data.

Description

Display method, device, equipment, readable storage medium and computer program product

技术领域Technical Field

本申请属于通信技术领域，具体涉及一种显示方法、装置、设备、可读存储介质及计算机程序产品。The present application belongs to the field of communication technology, and specifically relates to a display method, device, equipment, readable storage medium and computer program product.

背景技术Background technique

目前元宇宙场景体验时，大部分的虚拟对象在与用户对话时无肢体动作或动作单一，给用户一种不是真实场景对话聊天的感觉，造成交互沉浸感差，用户体验较差。Currently, when experiencing the metaverse scene, most virtual objects have no body movements or simple movements when talking to users, giving users a feeling that it is not a real-scene conversation or chat, resulting in poor interactive immersion and a poor user experience.

发明内容Summary of the invention

本申请实施例提供一种显示方法、装置、设备、可读存储介质及计算机程序产品，能够解决现有虚拟对象与用户交互沉浸感差，用户体验较差的问题。The embodiments of the present application provide a display method, apparatus, device, readable storage medium and computer program product, which can solve the problems of poor immersion and poor user experience in the existing interaction between virtual objects and users.

第一方面，提供了一种显示方法，包括：In a first aspect, a display method is provided, comprising:

获取用户与虚拟对象的对话内容和所述用户与所述虚拟对象对话过程中的视频数据；Acquire the conversation content between the user and the virtual object and the video data of the conversation between the user and the virtual object;

对所述对话内容进行识别得到对话识别结果，以及对所述视频数据进行识别得到所述用户的第一骨骼关键点数据；Recognize the conversation content to obtain a conversation recognition result, and recognize the video data to obtain first skeleton key point data of the user;

根据所述对话识别结果和所述第一骨骼关键点数据，确定第二骨骼关键点数据；Determine second skeleton key point data according to the dialogue recognition result and the first skeleton key point data;

根据所述第二骨骼关键点数据，控制所述虚拟对象的显示。The display of the virtual object is controlled according to the second skeleton key point data.

在一些实施方式中，所述对所述视频数据进行识别得到第一骨骼关键点数据，包括：In some implementations, the step of identifying the video data to obtain first skeleton key point data includes:

根据所述视频数据，确定图像帧序列；Determining an image frame sequence according to the video data;

对所述图像帧序列进行姿态估计，得到二维关键点数据；Performing posture estimation on the image frame sequence to obtain two-dimensional key point data;

对所述二维关键点数据进行滤波处理，得到所述第一骨骼关键点数据。The two-dimensional key point data is filtered to obtain the first skeleton key point data.

在一些实施方式中，所述对所述图像帧序列进行姿态估计，得到二维关键点数据，包括：In some implementations, performing posture estimation on the image frame sequence to obtain two-dimensional key point data includes:

根据所述图像帧序列，确定N个图像帧；Determine N image frames according to the image frame sequence;

将所述N个图像帧依次输入姿态估计模型，得到N个二维关键点数据，所述姿态估计模型为卷积神经网络模型；Inputting the N image frames into a posture estimation model in sequence to obtain N two-dimensional key point data, wherein the posture estimation model is a convolutional neural network model;

其中，所述N为大于1的正整数。Wherein, N is a positive integer greater than 1.

在一些实施方式中，所述对所述二维关键点数据进行滤波处理，得到所述第一骨骼关键点数据，包括：In some implementations, filtering the two-dimensional key point data to obtain the first skeleton key point data includes:

对所述N个二维关键点数据进行相邻帧相减处理，得到N-1个关键点差分数据；Subtract adjacent frames from each other to obtain N-1 key point differential data.

根据所述N个二维关键点数据和所述N-1个关键点差分数据，确定自适应稳定时序差分数据；Determine adaptive stable time series differential data according to the N two-dimensional key point data and the N-1 key point differential data;

将所述自适应稳定时序差分数据和所述N个二维关键点数据输入滤波模型，得到所述第一骨骼关键点数据；Inputting the adaptive stable time-series difference data and the N two-dimensional key point data into a filtering model to obtain the first skeleton key point data;

其中，所述滤波模型的网络结构中包含全连接残差层。Among them, the network structure of the filtering model includes a fully connected residual layer.

在一些实施方式中，所述根据所述N个二维关键点数据和所述N-1个关键点差分数据，确定自适应稳定时序差分数据，包括：In some implementations, determining the adaptive stable temporal difference data according to the N two-dimensional key point data and the N-1 key point difference data includes:

根据所述N个二维关键点数据和所述N-1个关键点差分数据，确定相邻两帧的关键点数据变化率；Determine a key point data change rate between two adjacent frames according to the N two-dimensional key point data and the N-1 key point differential data;

根据所述相邻两帧的关键点数据变化率，确定关键点数据变化率均值；Determine a mean value of the key point data change rate according to the key point data change rate of the two adjacent frames;

根据所述相邻两帧的关键点数据变化率、所述关键点数据变化率均值和所述N-1个关键点差分数据，确定所述自适应稳定时序差分数据。The adaptive stable time series differential data is determined according to the key point data change rate of the two adjacent frames, the key point data change rate average and the N-1 key point differential data.

在一些实施方式中，所述根据所述对话识别结果和所述第一骨骼关键点数据，确定第二骨骼关键点数据，包括：In some implementations, determining second skeleton key point data according to the dialogue recognition result and the first skeleton key point data includes:

根据所述对话识别结果，在虚拟对象数据库中的虚拟对象语料库确定多个第一匹配结果；其中，所述对话识别结果中包含情感值和对话场景，所述虚拟对象数据库中包含虚拟对象语料库和虚拟对象动作库，所述虚拟对象语料库中包含多个预存文本内容，所述虚拟对象动作库中包含多个预存骨骼关键点数据，每个所述预存骨骼关键点数据分别具有对应的情感值和对话场景；According to the dialogue recognition result, a plurality of first matching results are determined in a virtual object corpus in a virtual object database; wherein the dialogue recognition result includes an emotion value and a dialogue scene, the virtual object database includes a virtual object corpus and a virtual object action library, the virtual object corpus includes a plurality of pre-stored text contents, the virtual object action library includes a plurality of pre-stored skeleton key point data, and each of the pre-stored skeleton key point data has a corresponding emotion value and a dialogue scene;

计算每个所述第一匹配结果与所述对话识别结果之间的相似度；Calculating the similarity between each of the first matching results and the conversation recognition result;

在所有所述第一匹配结果与所述对话识别结果之间的相似度均超出预设相似度范围的情况下，将所述虚拟对象动作库中与所述对话识别结果中的情感值对应的第一预存骨骼关键点数据确定为所述第二骨骼关键点数据；In the case where the similarities between all the first matching results and the dialogue recognition results exceed a preset similarity range, determining the first pre-stored skeleton key point data corresponding to the emotion value in the dialogue recognition result in the virtual object action library as the second skeleton key point data;

在所述多个第一匹配结果中存在相似度在预设相似度范围内的第一匹配结果的情况下，根据所述对话识别结果中的对话场景在所述虚拟对象动作库中确定至少一个第二预存骨骼关键点数据；In the case where there is a first matching result whose similarity is within a preset similarity range among the plurality of first matching results, determining at least one second pre-stored skeleton key point data in the virtual object action library according to the dialogue scene in the dialogue recognition result;

根据所述对话识别结果中的对话场景和所述第一骨骼关键点数据，在所述至少一个第二预存骨骼关键点数据中确定所述第二骨骼关键点数据。According to the dialogue scene in the dialogue recognition result and the first skeleton key point data, the second skeleton key point data is determined in the at least one second pre-stored skeleton key point data.

在一些实施方式中，所述根据所述对话识别结果中的对话场景和所述第一骨骼关键点数据，在所述至少一个第二预存骨骼关键点数据中确定所述第二骨骼关键点数据，包括：In some implementations, determining the second skeleton key point data in the at least one second pre-stored skeleton key point data according to the dialogue scene in the dialogue recognition result and the first skeleton key point data includes:

根据所述对话识别结果中的对话场景、第一权重和每个所述第二预存骨骼关键点数据，计算每个所述第二预存骨骼关键点数据的第一相似度；Calculate a first similarity of each of the second pre-stored skeleton key point data according to the dialogue scene in the dialogue recognition result, the first weight, and each of the second pre-stored skeleton key point data;

根据所述第一骨骼关键点数据、第二权重和每个所述第二预存骨骼关键点数据，计算每个所述第二预存骨骼关键点数据的第二相似度；Calculate the second similarity of each of the second pre-stored skeleton key point data according to the first skeleton key point data, the second weight and each of the second pre-stored skeleton key point data;

根据所述第一相似度和所述第二相似度，确定每个所述第二预存骨骼关键点数据的第三相似度；Determine a third similarity of each of the second pre-stored skeleton key point data according to the first similarity and the second similarity;

根据所述第三相似度满足预设条件的第二预存骨骼关键点数据确定为所述第二骨骼关键点数据。The second pre-stored skeleton key point data satisfying a preset condition according to the third similarity is determined as the second skeleton key point data.

第二方面，提供了一种显示装置，包括：In a second aspect, a display device is provided, comprising:

获取模块，用于获取用户与虚拟对象的对话内容和所述用户与所述虚拟对象对话过程中的视频数据；An acquisition module, used to acquire the conversation content between the user and the virtual object and the video data of the conversation between the user and the virtual object;

识别模块，用于对所述对话内容进行识别得到对话识别结果，以及对所述视频数据进行识别得到所述用户的第一骨骼关键点数据；A recognition module, used for recognizing the conversation content to obtain a conversation recognition result, and recognizing the video data to obtain the first skeleton key point data of the user;

确定模块，用于根据所述对话识别结果和所述第一骨骼关键点数据，确定第二骨骼关键点数据；A determination module, used for determining second skeleton key point data according to the dialogue recognition result and the first skeleton key point data;

显示模块，用于根据所述第二骨骼关键点数据，控制所述虚拟对象的显示。A display module is used to control the display of the virtual object according to the second skeleton key point data.

在一些实施方式中，所述识别模块，用于：In some embodiments, the identification module is used to:

在一些实施方式中，所述确定模块，用于：In some embodiments, the determining module is used to:

第三方面，提供了一种设备，该终端包括处理器和存储器，所述存储器存储可在所述处理器上运行的程序或指令，所述程序或指令被所述处理器执行时实现如第一方面所述的方法的步骤。In a third aspect, a device is provided, which terminal includes a processor and a memory, wherein the memory stores a program or instruction that can be executed on the processor, and when the program or instruction is executed by the processor, the steps of the method described in the first aspect are implemented.

第四方面，提供了一种可读存储介质，所述可读存储介质上存储程序或指令，所述程序或指令被处理器执行时实现如第一方面所述的方法的步骤。In a fourth aspect, a readable storage medium is provided, on which a program or instruction is stored. When the program or instruction is executed by a processor, the steps of the method described in the first aspect are implemented.

第五方面，提供了一种芯片，所述芯片包括处理器和通信接口，所述通信接口和所述处理器耦合，所述处理器用于运行程序或指令，实现如第一方面所述的方法。In a fifth aspect, a chip is provided, comprising a processor and a communication interface, wherein the communication interface is coupled to the processor, and the processor is used to run a program or instruction to implement the method described in the first aspect.

第六方面，提供了一种计算机程序/程序产品，所述计算机程序/程序产品被存储在存储介质中，所述程序/程序产品被至少一个处理器执行以实现如第一方面所述的方法。In a sixth aspect, a computer program/program product is provided. The computer program/program product is stored in a storage medium, and the program/program product is executed by at least one processor to implement the method as described in the first aspect.

在本申请实施例中，基于用户与虚拟对象的对话内容识别得到对话识别结果，基于用户与虚拟对象对话过程中的视频数据识别得到用户的第一骨骼关键点数据，基于对话识别结果和第一骨骼关键点数据确定，第二骨骼关键点数据，并根据第二骨骼关键点数据确定虚拟对象的显示。这样在用户与虚拟对象进行交互的过程中，能够实时基于用户行为对应匹配出虚拟对象的骨骼关键点数据，并基于匹配出的骨骼关键点数据驱动虚拟对象执行相应的动作，实现虚拟对象能够基于用户行为执行对应的反馈行为，增强用户与虚拟对象交互的沉浸感，提高用户体验。In an embodiment of the present application, a dialogue recognition result is obtained based on the recognition of the dialogue content between the user and the virtual object, and the first skeleton key point data of the user is obtained based on the recognition of the video data during the dialogue between the user and the virtual object. The second skeleton key point data is determined based on the dialogue recognition result and the first skeleton key point data, and the display of the virtual object is determined based on the second skeleton key point data. In this way, in the process of interaction between the user and the virtual object, the skeleton key point data of the virtual object can be matched in real time based on the user's behavior, and the virtual object can be driven to perform corresponding actions based on the matched skeleton key point data, so that the virtual object can perform corresponding feedback behaviors based on the user's behavior, enhance the immersion of the user's interaction with the virtual object, and improve the user experience.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

图1是本申请实施例提供的显示方法的流程示意图；FIG1 is a schematic diagram of a flow chart of a display method provided in an embodiment of the present application;

图2a是本申请实施例提供的显示方法的应用示意图之一；FIG. 2a is one of the application schematic diagrams of the display method provided in the embodiment of the present application;

图2b是本申请实施例提供的显示方法的应用示意图之二；FIG2b is a second application schematic diagram of the display method provided in an embodiment of the present application;

图2c是本申请实施例提供的显示方法的应用示意图之三；FIG2c is a third application schematic diagram of the display method provided in an embodiment of the present application;

图2d是本申请实施例提供的显示方法的应用示意图之四；FIG2d is a fourth application schematic diagram of the display method provided in an embodiment of the present application;

图2e是本申请实施例提供的显示方法的应用示意图之五；FIG2e is a fifth application schematic diagram of the display method provided in an embodiment of the present application;

图2f是本申请实施例提供的显示方法的应用示意图之六；FIG2f is a sixth application schematic diagram of the display method provided in an embodiment of the present application;

图2g是本申请实施例提供的显示方法的应用示意图之七；FIG2g is a seventh application schematic diagram of the display method provided in an embodiment of the present application;

图2h是本申请实施例提供的显示方法的应用示意图之八；FIG2h is an eighth application schematic diagram of the display method provided in an embodiment of the present application;

图2i是本申请实施例提供的显示方法的应用示意图之九；FIG2i is a ninth application schematic diagram of the display method provided in an embodiment of the present application;

图2j是本申请实施例提供的显示方法的应用示意图之十；FIG2j is a tenth application schematic diagram of the display method provided in an embodiment of the present application;

图2k是本申请实施例提供的显示方法的应用示意图之十一；FIG2k is an eleventh application schematic diagram of the display method provided in an embodiment of the present application;

图2l是本申请实施例提供的显示方法的应用示意图之十二；FIG21 is a twelfth schematic diagram of an application of a display method provided in an embodiment of the present application;

图2m是本申请实施例提供的显示方法的应用示意图之十三；FIG2m is a thirteenth application schematic diagram of the display method provided in an embodiment of the present application;

图3是本申请实施例提供的显示装置的结构示意图；FIG3 is a schematic diagram of the structure of a display device provided in an embodiment of the present application;

图4是本申请实施例提供的设备的结构示意图之一；FIG4 is one of the structural schematic diagrams of the device provided in the embodiment of the present application;

图5是本申请实施例提供的设备的结构示意图之二。FIG. 5 is a second schematic diagram of the structure of the device provided in an embodiment of the present application.

具体实施方式Detailed ways

下面将结合本申请实施例中的附图，对本申请实施例中的技术方案进行清楚描述，显然，所描述的实施例是本申请一部分实施例，而不是全部的实施例。基于本申请中的实施例，本领域普通技术人员所获得的所有其他实施例，都属于本申请保护的范围。The following will be combined with the drawings in the embodiments of the present application to clearly describe the technical solutions in the embodiments of the present application. Obviously, the described embodiments are part of the embodiments of the present application, rather than all the embodiments. Based on the embodiments in the present application, all other embodiments obtained by ordinary technicians in this field belong to the scope of protection of this application.

本申请的术语“第一”、“第二”等是用于区别类似的对象，而不用于描述特定的顺序或先后次序。应该理解这样使用的术语在适当情况下可以互换，以便本申请的实施例能够以除了在这里图示或描述的那些以外的顺序实施，且“第一”、“第二”所区别的对象通常为一类，并不限定对象的个数，例如第一对象可以是一个，也可以是多个。此外，本申请中的“和/或”表示所连接对象的至少其中之一。例如“A或B”涵盖三种方案，即，方案一：包括A且不包括B；方案二：包括B且不包括A；方案三：既包括A又包括B。字符“/”一般表示前后关联对象是一种“或”的关系。The terms "first", "second", etc. of the present application are used to distinguish similar objects, and are not used to describe a specific order or sequence. It should be understood that the terms used in this way are interchangeable where appropriate, so that the embodiments of the present application can be implemented in an order other than those illustrated or described herein, and the objects distinguished by "first" and "second" are generally of one type, and the number of objects is not limited, for example, the first object can be one or more. In addition, "and/or" in the present application represents at least one of the connected objects. For example, "A or B" covers three schemes, namely, Scheme 1: including A and excluding B; Scheme 2: including B and excluding A; Scheme 3: including both A and B. The character "/" generally indicates that the objects associated with each other are in an "or" relationship.

本申请的术语“指示”既可以是一个直接的指示(或者说显式的指示)，也可以是一个间接的指示(或者说隐含的指示)。其中，直接的指示可以理解为，发送方在发送的指示中明确告知了接收方具体的信息、需要执行的操作或请求结果等内容；间接的指示可以理解为，接收方根据发送方发送的指示确定对应的信息，或者进行判断并根据判断结果确定需要执行的操作或请求结果等。The term "indication" in this application can be a direct indication (or explicit indication) or an indirect indication (or implicit indication). A direct indication can be understood as the sender explicitly informing the receiver of specific information, operations to be performed, or request results in the sent indication; an indirect indication can be understood as the receiver determining the corresponding information according to the indication sent by the sender, or making a judgment and determining the operation to be performed or the request result according to the judgment result.

下面结合附图，通过一些实施例及其应用场景对本申请实施例提供的显示方法进行详细地说明。The display method provided in the embodiment of the present application is described in detail below through some embodiments and their application scenarios in combination with the accompanying drawings.

参见图1，本申请实施例提供一种显示方法，该方法，包括：Referring to FIG. 1 , an embodiment of the present application provides a display method, the method comprising:

步骤101：获取用户与虚拟对象的对话内容和用户与虚拟对象对话过程中的视频数据；Step 101: Acquire the conversation content between the user and the virtual object and the video data of the conversation between the user and the virtual object;

步骤102：对对话内容进行识别得到对话识别结果，以及对视频数据进行识别得到用户的第一骨骼关键点数据；Step 102: Recognize the conversation content to obtain a conversation recognition result, and recognize the video data to obtain the first skeleton key point data of the user;

步骤103：根据对话识别结果和第一骨骼关键点数据，确定第二骨骼关键点数据；Step 103: Determine the second skeleton key point data according to the dialogue recognition result and the first skeleton key point data;

步骤104：根据第二骨骼关键点数据，控制虚拟对象的显示。Step 104: Control the display of the virtual object according to the second skeleton key point data.

本申请实施例针对的应用场景可以是元宇宙中用户与数字人进行交互的场景，或者用户与其他类别的虚拟对象进行交互的场景，例如用户在虚拟游戏场景中与游戏非玩家角色(non-player characte，NPC)交互的场景，本申请实施例对方案应用场景与虚拟对象具体类别不做限定，只要满足用户与虚拟对象交互且能够获取到本申请方案中所涉及的相关数据的场景即可。The application scenarios targeted by the embodiments of the present application may be scenarios in which users interact with digital humans in the metaverse, or scenarios in which users interact with virtual objects of other categories, such as scenarios in which users interact with non-player characters (NPCs) in a virtual game scene. The embodiments of the present application do not limit the application scenarios of the solutions and the specific categories of virtual objects, as long as the scenarios in which users interact with virtual objects and can obtain the relevant data involved in the solutions of the present application are met.

上述用户与虚拟对象的对话内容可以是基于用户输入的对话内容确定，例如可以是用户输入的文字信息，基于该文字信息提取出与虚拟对象对话的对话内容，或者也可以是用户输入的语音信息，基于语音识别功能提取出与虚拟对象对话的对话内容，具体实施均可以基于现有对话内容提取方法实现。The conversation content between the user and the virtual object can be determined based on the conversation content input by the user. For example, it can be text information input by the user, and the conversation content with the virtual object is extracted based on the text information, or it can be voice information input by the user, and the conversation content with the virtual object is extracted based on the voice recognition function. The specific implementation can be achieved based on the existing conversation content extraction method.

上述用户与虚拟对象对话过程中的视频数据指的是用户在与虚拟对象对话过程中，由视频拍摄设备拍摄的视频数据，该视频拍摄设备可以是外置的摄像头，或者用户佩戴的智能设备，例如VR设备等，本申请实施例对于获取用户与虚拟对象对话过程中的视频数据的具体实现方式不做限定，利用已知的现有用户与虚拟对象对话过程中的视频数据的获取方式即可实现。The video data of the above-mentioned user's conversation with the virtual object refers to the video data captured by a video shooting device during the user's conversation with the virtual object. The video shooting device can be an external camera, or a smart device worn by the user, such as a VR device. The embodiment of the present application does not limit the specific implementation method for obtaining the video data of the user's conversation with the virtual object, and it can be achieved by using the known existing method for obtaining the video data of the user's conversation with the virtual object.

需要说明的是，上述对对话内容进行识别得到对话识别结果具体可以基于现有语义识别技术实现。It should be noted that the above-mentioned recognition of the conversation content to obtain the conversation recognition result can be specifically implemented based on the existing semantic recognition technology.

在一些实施方式中，对视频数据进行识别得到第一骨骼关键点数据，包括：In some implementations, identifying the video data to obtain the first skeleton key point data includes:

(1)根据视频数据，确定图像帧序列；(1) Determine an image frame sequence based on video data;

(2)对图像帧序列进行姿态估计，得到二维关键点数据；(2) Perform pose estimation on the image frame sequence to obtain two-dimensional key point data;

(3)对二维关键点数据进行滤波处理，得到第一骨骼关键点数据。(3) Filter the two-dimensional key point data to obtain the first skeleton key point data.

在本申请实施例中，对视频数据的识别处理具体为：将视频数据划分为多个图像的序列，然后针对该图像序列进行姿态估计处理，具体可以是通过人体姿态估计网络预测出人体2D关键点，考虑到人体姿态估计网络预测的人体2D关键点，直接应用深度学习模型输出关键点会出现抖动的情况，进而需要对模型输出结果进行滤波预处理，从而提高数据处理的准确性。In an embodiment of the present application, the recognition processing of video data is specifically: dividing the video data into a sequence of multiple images, and then performing posture estimation processing on the image sequence. Specifically, the 2D key points of the human body can be predicted through a human posture estimation network. Considering the 2D key points of the human body predicted by the human posture estimation network, directly applying the deep learning model to output the key points will cause jitter, and then the model output results need to be filtered and preprocessed to improve the accuracy of data processing.

在一些实施方式中，对图像帧序列进行姿态估计，得到二维关键点数据，包括：In some implementations, performing posture estimation on an image frame sequence to obtain two-dimensional key point data includes:

(1)根据图像帧序列，确定N个图像帧；(1) Determine N image frames according to the image frame sequence;

(2)将N个图像帧依次输入姿态估计模型，得到N个二维关键点数据，姿态估计模型为卷积神经网络模型；(2) Inputting N image frames into the posture estimation model in sequence to obtain N two-dimensional key point data, the posture estimation model is a convolutional neural network model;

其中，N为大于1的正整数。Wherein, N is a positive integer greater than 1.

在本申请实施例中，设置需要滤波的图像帧数量N，将该N个图像帧依次输入姿态估计模型，得到N个二维关键点数据。In an embodiment of the present application, a number N of image frames that need to be filtered is set, and the N image frames are sequentially input into a posture estimation model to obtain N two-dimensional key point data.

可以理解的是，上述N可以等于图像帧序列中的图像帧总数，即对全部图像帧进行姿态估计，或者上述N可以小于图像帧序列中的图像帧总数，即对全部图像帧中的部分图像帧进行姿态估计，具体根据实际技术需求灵活设置。It can be understood that the above N can be equal to the total number of image frames in the image frame sequence, that is, posture estimation is performed on all image frames, or the above N can be less than the total number of image frames in the image frame sequence, that is, posture estimation is performed on some image frames in all image frames, and the specific setting can be flexibly based on actual technical requirements.

在一些实施方式中，对二维关键点数据进行滤波处理，得到第一骨骼关键点数据，包括：In some implementations, filtering the two-dimensional key point data to obtain first skeleton key point data includes:

(1)对N个二维关键点数据进行相邻帧相减处理，得到N-1个关键点差分数据；(1) Subtract adjacent frames of N two-dimensional key point data to obtain N-1 key point differential data;

(2)根据N个二维关键点数据和N-1个关键点差分数据，确定自适应稳定时序差分数据；(2) Determine adaptive stable time series difference data based on N two-dimensional key point data and N-1 key point difference data;

(3)将自适应稳定时序差分数据和N个二维关键点数据输入滤波模型，得到第一骨骼关键点数据；(3) inputting the adaptive stable time difference data and N two-dimensional key point data into the filtering model to obtain the first skeleton key point data;

其中，滤波模型的网络结构中包含全连接残差层。Among them, the network structure of the filtering model includes a fully connected residual layer.

本申请实施例中，基于N个二维关键点数据，以及相邻帧之间的N-1个关键点差分数据确定自适应稳定时序差分数据，然后通过具有全连接残差层的滤波模型预测得到第一骨骼关键点数据。In an embodiment of the present application, adaptive stable temporal difference data is determined based on N two-dimensional key point data and N-1 key point differential data between adjacent frames, and then the first skeleton key point data is predicted by a filtering model with a fully connected residual layer.

上述自适应稳定时序差分数据能够有效的平稳N个二维关键点数据，当某个关键点数据跳变较大时，则自适应减少残差值进而使得预测结果跳变范围缩小。The above-mentioned adaptive stable time series difference data can effectively stabilize N two-dimensional key point data. When a key point data jumps greatly, the residual value is adaptively reduced to reduce the jump range of the prediction result.

在一些实施方式中，根据N个二维关键点数据和N-1个关键点差分数据，确定自适应稳定时序差分数据，包括：In some implementations, determining adaptive stable temporal difference data based on N two-dimensional key point data and N-1 key point difference data includes:

(1)根据N个二维关键点数据和N-1个关键点差分数据，确定相邻两帧的关键点数据变化率；(1) Determine the key point data change rate between two adjacent frames based on N two-dimensional key point data and N-1 key point differential data;

(2)根据相邻两帧的关键点数据变化率，确定关键点数据变化率均值；(2) Determine the average value of the key point data change rate based on the key point data change rate of two adjacent frames;

(3)根据相邻两帧的关键点数据变化率、关键点数据变化率均值和N-1个关键点差分数据，确定自适应稳定时序差分数据。(3) Determine the adaptive stable time series differential data based on the key point data change rate of two adjacent frames, the average value of the key point data change rate and N-1 key point differential data.

本申请实施例中，计算N个二维关键点数据的关键点数据变化率均值并计算相邻两帧的关键点数据变化率，然后通过换算实现自适应稳定时序差分数据的计算，最后将计算得到的自适应稳定时序差分数据作为滤波网络的支输入In the embodiment of the present application, the average value of the key point data change rate of N two-dimensional key point data is calculated and the key point data change rate of two adjacent frames is calculated, and then the adaptive stable time series difference data is calculated by conversion, and finally the calculated adaptive stable time series difference data is used as the branch input of the filtering network.

在一些实施方式中，根据对话识别结果和第一骨骼关键点数据，确定第二骨骼关键点数据，包括：In some implementations, determining the second skeleton key point data according to the dialogue recognition result and the first skeleton key point data includes:

(1)根据对话识别结果，在虚拟对象数据库中的虚拟对象语料库确定多个第一匹配结果；(1) determining a plurality of first matching results in a virtual object corpus in a virtual object database according to the dialogue recognition result;

其中，对话识别结果中包含情感值(例如可以指示用户对话情感为积极、消极或中性)和对话场景，虚拟对象数据库中包含虚拟对象语料库和虚拟对象动作库，虚拟对象语料库中包含多个预存文本内容，虚拟对象动作库中包含多个预存骨骼关键点数据，每个预存骨骼关键点数据分别具有对应的情感值和对话场景；The dialogue recognition result includes an emotion value (for example, indicating that the user's dialogue emotion is positive, negative or neutral) and a dialogue scene, the virtual object database includes a virtual object corpus and a virtual object action library, the virtual object corpus includes a plurality of pre-stored text contents, the virtual object action library includes a plurality of pre-stored skeleton key point data, and each pre-stored skeleton key point data has a corresponding emotion value and a dialogue scene;

例如：NPC数据库包括NPC人体动作库和NPC语料库，数据库的创建主要来源于大量电影对话场景数据以及短视频对话场景数据，对搜集的数据进行批量转换为对话文本内容特征向量转化和人体动作视频转化为人体骨骼驱动数据。NPC人体动作库中基础动作库主要为常见的挥手、鞠躬、祝福等动作。For example, the NPC database includes the NPC human action library and the NPC corpus. The creation of the database is mainly based on a large amount of movie dialogue scene data and short video dialogue scene data. The collected data is batch converted into dialogue text content feature vector conversion and human action video into human skeleton drive data. The basic action library in the NPC human action library mainly includes common actions such as waving, bowing, blessing, etc.

需要说明的是，上述NPC人体动作库和NPC语料库的建立可以基于现有技术实现，或者也可以使用上述用户对话识别结果和第一骨骼关键点数据的识别处理方式，即采用与用户对话识别相同的方式从大量电影对话场景数据以及短视频对话场景数据中手机NPC的对话内容素材，相应的采用与用户骨骼关键点数据识别相同的方式从大量电影对话场景数据以及短视频对话场景数据中收集NPC的骨骼关键点数据。It should be noted that the establishment of the above-mentioned NPC human motion library and NPC corpus can be implemented based on existing technologies, or the above-mentioned user dialogue recognition results and the recognition and processing method of the first skeletal key point data can be used, that is, the dialogue content materials of the NPC are collected from a large amount of movie dialogue scene data and short video dialogue scene data in the same way as the user dialogue recognition, and the NPC's skeletal key point data are collected from a large amount of movie dialogue scene data and short video dialogue scene data in the same way as the user skeleton key point data recognition.

可以理解的是，在建立虚拟对象数据库时，虚拟对象动作库中的每个预存骨骼关键点数据分别具有对应的情感值和对话场景，这样才能够实现基于用户的对话识别结果在虚拟对象动作库中匹配对应的骨骼关键点数据。It is understandable that when establishing a virtual object database, each pre-stored skeletal key point data in the virtual object action library has a corresponding emotion value and dialogue scene, so that the corresponding skeletal key point data can be matched in the virtual object action library based on the user's dialogue recognition results.

(2)计算每个第一匹配结果与对话识别结果之间的相似度；根据结果执行(3)或者执行(4)、(5)；(2) Calculate the similarity between each first matching result and the dialog recognition result; execute (3) or execute (4) or (5) according to the result;

首先将用户对话内容识别出的对话识别结果与虚拟对象语料库进行匹配得到多组相似文本匹配结果，然后分别计算每个第一匹配结果与对话识别结果之间的相似度，其中具体的相似度计算需要结合情感值和对话场景综合进行计算，具体的计算方法可以采用技术相似度计算方法，例如欧式距离相似度计算方法，本申请实施例对相似度计算方法不做具体限定，以下对方案的描述为了清楚，统一以欧式距离相似度计算方法为例进行说明，其中基于欧式距离相似度计算方法计算出的相似度越小，表示两者相似的程度越高。First, the conversation recognition result obtained by identifying the user conversation content is matched with the virtual object corpus to obtain multiple groups of similar text matching results, and then the similarity between each first matching result and the conversation recognition result is calculated respectively, wherein the specific similarity calculation needs to be comprehensively calculated in combination with the sentiment value and the conversation scene. The specific calculation method may adopt a technical similarity calculation method, such as a Euclidean distance similarity calculation method. The embodiment of the present application does not specifically limit the similarity calculation method. For the sake of clarity, the following description of the scheme is uniformly described using the Euclidean distance similarity calculation method as an example, wherein the smaller the similarity calculated based on the Euclidean distance similarity calculation method, the higher the degree of similarity between the two.

(3)在所有第一匹配结果与对话识别结果之间的相似度均超出预设相似度范围的情况下，将虚拟对象动作库中与对话识别结果中的情感值对应的第一预存骨骼关键点数据确定为第二骨骼关键点数据；(3) when the similarities between all the first matching results and the dialogue recognition results exceed the preset similarity range, the first pre-stored skeleton key point data corresponding to the emotion value in the dialogue recognition result in the virtual object action library is determined as the second skeleton key point data;

上述预设相似度范围可以是人为设置的相似度阈值，用于对第一匹配结果进行过滤。以欧式距离相似度计算，如果过所有第一匹配结果的相似度计算结果均大于阈值，则表示第一匹配结果整体相似度较差，此时通过情感值在虚拟对象动作库匹配出第一预存骨骼关键点数据作为第二骨骼关键点数据。The preset similarity range can be a similarity threshold set manually, which is used to filter the first matching result. If the similarity calculation results of all the first matching results are greater than the threshold value by Euclidean distance similarity calculation, it means that the overall similarity of the first matching result is poor. At this time, the first pre-stored skeleton key point data is matched in the virtual object action library by the emotion value as the second skeleton key point data.

(4)在多个第一匹配结果中存在相似度在预设相似度范围内的第一匹配结果的情况下，根据对话识别结果中的对话场景在虚拟对象动作库中确定至少一个第二预存骨骼关键点数据；(4) when there is a first matching result whose similarity is within a preset similarity range among the multiple first matching results, determining at least one second pre-stored skeleton key point data in the virtual object action library according to the dialogue scene in the dialogue recognition result;

(5)根据对话识别结果中的对话场景和第一骨骼关键点数据，在至少一个第二预存骨骼关键点数据中确定第二骨骼关键点数据。(5) Determine second skeleton key point data from at least one second pre-stored skeleton key point data based on the dialogue scene in the dialogue recognition result and the first skeleton key point data.

以欧式距离相似度计算，如果存在第一匹配结果的相似度计算结果均小于阈值，则表示存在至少部分第一匹配结果相似度比较高，此时进一步结合对话场景和第一骨骼关键点数据进行第二骨骼关键点数据的匹配选择，具体是先以对话场景为参考在虚拟对象动作库中确定至少一个第二预存骨骼关键点数据，然后结合对话场景和第一骨骼关键点数据筛选出第二骨骼关键点数据。Using the Euclidean distance similarity calculation, if the similarity calculation results of the first matching results are all less than the threshold value, it means that at least some of the first matching results have a relatively high similarity. At this time, the second skeleton key point data is further matched and selected in combination with the dialogue scene and the first skeleton key point data. Specifically, at least one second pre-stored skeleton key point data is first determined in the virtual object action library with reference to the dialogue scene, and then the second skeleton key point data is filtered out in combination with the dialogue scene and the first skeleton key point data.

在一些实施方式中，根据对话识别结果中的对话场景和第一骨骼关键点数据，在至少一个第二预存骨骼关键点数据中确定第二骨骼关键点数据，包括：In some implementations, determining second skeleton key point data from at least one second pre-stored skeleton key point data according to the dialogue scene in the dialogue recognition result and the first skeleton key point data includes:

(1)根据对话识别结果中的对话场景、第一权重和每个第二预存骨骼关键点数据，计算每个第二预存骨骼关键点数据的第一相似度；(1) calculating a first similarity of each second pre-stored skeleton key point data according to the dialogue scene in the dialogue recognition result, the first weight, and each second pre-stored skeleton key point data;

计算对话场景与第二预存骨骼关键点数据的相似度，并附加相应的权重；可以理解的是，具体权重可以基于经验值设置。The similarity between the dialogue scene and the second pre-stored skeleton key point data is calculated, and a corresponding weight is added; it is understandable that the specific weight can be set based on an empirical value.

(2)根据第一骨骼关键点数据、第二权重和每个第二预存骨骼关键点数据，计算每个第二预存骨骼关键点数据的第二相似度；(2) calculating the second similarity of each second pre-stored skeleton key point data according to the first skeleton key point data, the second weight and each second pre-stored skeleton key point data;

计算第一骨骼关键点数据与第二预存骨骼关键点数据的相似度，并附加相应的权重；可以理解的是，具体权重可以基于经验值设置。The similarity between the first skeleton key point data and the second pre-stored skeleton key point data is calculated, and corresponding weights are added; it is understandable that the specific weights can be set based on empirical values.

(3)根据第一相似度和第二相似度，确定每个第二预存骨骼关键点数据的第三相似度；(3) determining a third similarity of each second pre-stored skeleton key point data according to the first similarity and the second similarity;

在计算完对话场景与第二预存骨骼关键点数据的相似度，以及第一骨骼关键点数据与第二预存骨骼关键点数据的相似度之后，基于两者的权重做进一步加权计算，得到最终的相似度结果。After calculating the similarity between the dialogue scene and the second pre-stored skeleton key point data, and the similarity between the first skeleton key point data and the second pre-stored skeleton key point data, further weighted calculation is performed based on the weights of the two to obtain the final similarity result.

(4)根据第三相似度满足预设条件的第二预存骨骼关键点数据确定为第二骨骼关键点数据。(4) The second pre-stored skeleton key point data that meets the preset conditions according to the third similarity is determined as the second skeleton key point data.

利用最终的相似度结果确定出第二骨骼关键点数据，上述预设条件可以是一个相似度阈值，或者也可以是直接选择相似度最优者。The second skeleton key point data is determined using the final similarity result. The above preset condition may be a similarity threshold, or may be directly selecting the one with the best similarity.

下面解对本申请的技术方案做进一步描述说明：The following is a further description of the technical solution of this application:

本方案总体流程图如图2a所示：The overall flow chart of this scheme is shown in Figure 2a:

主要技术步骤有：The main technical steps are:

用户端：user terminal:

步骤一、获取用户与NPC的对话内容并根据语义分析识别对话情景；Step 1: Obtain the conversation content between the user and the NPC and identify the conversation scenario based on semantic analysis;

步骤二、获取用户的视频流数据通过本方案的骨骼驱动识别人体动作骨骼关键点；Step 2: Obtain the user's video stream data and identify the key points of human body motion skeletons through the skeleton drive of this solution;

步骤三、将步骤一和步骤二的结果进行量化加权计算，计算结果后续用于NPC动作匹配；Step 3: Quantify and weight the results of step 1 and step 2, and use the results for NPC action matching;

NPC端：NPC side:

步骤一、组建NPC数据库(包含NPC人体动作库和NPC语料库)；Step 1: Build an NPC database (including NPC human motion library and NPC corpus);

步骤二、将用户端识别结果与NPC动作库进行动作匹配；Step 2: Match the user-side recognition result with the NPC action library;

步骤三、获得匹配后的结果对NPC进行动作驱动；Step 3: Obtain the matching result and drive the NPC to take action;

详细步骤阐述如下：The detailed steps are as follows:

用户端user terminal

步骤一、语义分析用户对话内容Step 1: Semantic analysis of user conversation content

语义分析模块流程主要如图2b所示：The semantic analysis module process is mainly shown in Figure 2b:

(1)首先通过收集的批量对话场景数据(文本、视频)进行语料库的创建；(1) First, we create a corpus by collecting batches of conversation scene data (text, video);

(2)对训练文本数据进行数据预处理，包括分词、去停用词等；(2) Perform data preprocessing on the training text data, including word segmentation and stop word removal;

(3)对预处理后的数据进行特征工程处理，如TF-IDF、Word2Vec、Bert模型；(3) Perform feature engineering on the preprocessed data, such as TF-IDF, Word2Vec, and Bert models;

(4)训练模型(情感分类模型、Word2Vec特征向量模型)；(4) Training model (sentiment classification model, Word2Vec feature vector model);

(5)输入用户对话文本并进行(2)-(3)步骤处理，将处理后结果输入训练后的模型；(5) Input the user conversation text and process it in steps (2)-(3), and input the processed results into the trained model;

(6)输出结果，该结果包含了用户情感分析(本方案主要分成3类即积极、消极和中性)、匹配对话场景库中最相似的n组数据(文本、视频)；(6) Output results, which include user sentiment analysis (mainly divided into three categories: positive, negative, and neutral in this solution) and the most similar n sets of data (text, video) in the matching dialogue scene library;

该模块可以采用现有技术，该步骤是对文本语料库进行特征向量编码进而将用户的对话内容与文本语料库进行数据匹配；部分技术阐述如下(以Word2Vec为例)：This module can use existing technology. This step is to encode the text corpus into feature vectors and then match the user's conversation content with the text corpus. Some of the technologies are described as follows (taking Word2Vec as an example):

Word2Vec模型包含两种训练词向量的方法：连续词袋(continuous bag ofwords，CBOW)和连续跳字(skip-gram)。其中CBOW的基本思想是通过该关键词的上下文的词语来预测这个关键词：而skip-gram的基本思想与CBOW相反，是通过该关键词来预测上下文的词语。如图2c所示：The Word2Vec model includes two methods for training word vectors: continuous bag of words (CBOW) and skip-gram. The basic idea of CBOW is to predict the keyword through the words in the context of the keyword. The basic idea of skip-gram is the opposite of CBOW, which is to predict the words in the context through the keyword. As shown in Figure 2c:

Word2Vec训练过程如下：The Word2Vec training process is as follows:

(1)CBOW：(1) CBOW:

Step1:首先将关键词的上下文词语进行独热编码，得到的向量长度为词表长度(所有训练语料中不同的词的总个数)，形状为1×V的向量，该词的位置为1，其余位置为0。Step 1: First, the context words of the keyword are one-hot encoded. The length of the obtained vector is the length of the vocabulary (the total number of different words in all training corpora), and the shape is a 1×V vector. The position of the word is 1, and the other positions are 0.

Step2:将每个上下文词语的独热编码乘以一个V×N的权重矩阵W，每个词语都乘以同一个权重矩阵W。得到多个1×N的向量。Step 2: Multiply the unique hot encoding of each context word by a V×N weight matrix W, and each word is multiplied by the same weight matrix W. Multiple 1×N vectors are obtained.

Step3:对于这么多个1×N的向量，采用相加求平均的方法整合成一个1×N的向量。Step 3: For so many 1×N vectors, use the method of adding and averaging to integrate them into a 1×N vector.

Step4:然后将这个1×N的向量与该关键字对应的N×V的矩阵W′，得到一个1×V的向量。Step 4: Then add this 1×N vector to the N×V matrix W′ corresponding to the keyword to get a 1×V vector.

Step5:对于这个1×V的向量需要通过softmax层归一化后得到的向量为关键词的预测向量，该向量不是独热编码，而是有许多浮点数值的概率组成的。概率最大的位置为关键词独热编码为1的位置。Step 5: This 1×V vector needs to be normalized through the softmax layer to obtain the predicted vector of the keyword. This vector is not one-hot encoded, but composed of many floating-point value probabilities. The position with the highest probability is the position where the keyword is one-hot encoded as 1.

Step6:将关键词的预测向量与标签向量(独热编码的向量)进行一个误差计算，通常采用交叉熵。Step 6: Perform an error calculation between the predicted keyword vector and the label vector (one-hot encoded vector), usually using cross entropy.

Step7:将这个误差反传回神经元，每一次前向计算后都会将误差反传，从而达到调整权重矩阵W和W′的目的，与BP神经网络同理。Step 7: The error is propagated back to the neuron. After each forward calculation, the error is propagated back to achieve the purpose of adjusting the weight matrices W and W′, which is the same as the BP neural network.

当损失达到最优则训练结束，便得到了我们需要的权重矩阵，通过这个权重矩阵便可根据输入的独热向量形成词向量。When the loss reaches the optimal value, the training ends and we get the weight matrix we need. Through this weight matrix, we can form a word vector based on the input one-hot vector.

(2)Skip-gram(2) Skip-gram

Step1:首先将关键词进行独热编码。Step 1: First, perform one-hot encoding on the keywords.

Step2:将每个关键词的独热编码乘以一个V×N的权重矩阵W。Step 2: Multiply the unique hot encoding of each keyword by a V×N weight matrix W.

Step3:然后将这个关键词的1×N的向量与该关键字的上下文词语词向量(共用一个)的N×V矩阵W′相乘，得到多个1×V的向量。Step 3: Then multiply the 1×N vector of this keyword with the N×V matrix W′ of the context word vector of the keyword (share one) to obtain multiple 1×V vectors.

Step4:对于这些1×V的向量需要通过softmax层归一化后得到的向量为关键词上下文词语的预测向量。Step 4: These 1×V vectors need to be normalized through the softmax layer to obtain the predicted vectors of the keyword context words.

Step5:将关键词上下文词语的预测向量与标签向量(独热编码的向量)进行一个误差计算，通常采用交叉熵，对得到的多个交叉熵损失进行求和。Step 5: Perform an error calculation between the predicted vector of the keyword context words and the label vector (one-hot encoded vector), usually using cross entropy, and sum up the multiple cross entropy losses obtained.

Step6:将这个误差反传回神经元，每一次前向计算后都会将误差反传，从而达到调整权重矩阵W和W′的目的。Step 6: The error is propagated back to the neuron. After each forward calculation, the error is propagated back to achieve the purpose of adjusting the weight matrices W and W′.

Step7:最后得到形成词向量的权重矩阵是W。Step 7: Finally, the weight matrix that forms the word vector is W.

步骤二、获取视频流数据并根据骨骼驱动模块输出人体骨骼关键点Step 2: Obtain video stream data and output human skeleton key points according to the skeleton drive module

该步骤主要阐述了骨骼驱动模块，该模块本方案提出一种创新的关键点滤波方法，增加自适应残差机制，有效的使得人体动作驱动更加平稳，其技术实现流程如图2d所示，相应得到的人体2D关键点如图2e所示，主要实现步骤如下：This step mainly describes the skeleton drive module. This module proposes an innovative key point filtering method and adds an adaptive residual mechanism to effectively make the human motion drive more stable. The technical implementation process is shown in Figure 2d, and the corresponding human 2D key points are shown in Figure 2e. The main implementation steps are as follows:

Step1:读取视频流并划分成图像帧序列Step 1: Read the video stream and divide it into image frame sequences

Step2:图像预处理Step 2: Image preprocessing

定位得到人体区域，将人体区域中心点向外扩1.2倍并按照长宽比16:9的比例进行抠图处理，将抠图后的图像重置到模型输入指定的大小进行训练及预测。The human body area is located, the center point of the human body area is expanded outward by 1.2 times and the image is cut out according to the aspect ratio of 16:9. The cutout image is reset to the size specified by the model input for training and prediction.

图像预处理示例图2f中的下图所示。对定位得到人体区域的检测框根据中心点向外扩1.2倍(如图大的虚线框)，然后按照长宽比16:9的比例进行抠图输入模型中进行训练，若检测到人体检测框如图2f中的右上角所示，则按照16:9进行外扩，超出的区域补0处理。对人体检测框进行16:9抠图处理主要是由于骨骼关键点模型输入设置的大小为16:9的比例，为了保证抠图后的图像resize到模型指定大小不出现形变的情况，保证了人体形状比例与原视频一致。The image preprocessing example is shown in the figure below in Figure 2f. The detection frame of the human body area is expanded outward by 1.2 times based on the center point (as shown in the large dotted box in the figure), and then cut out according to the aspect ratio of 16:9 and input into the model for training. If the human body detection frame is detected as shown in the upper right corner of Figure 2f, it is expanded outward according to 16:9, and the excess area is filled with 0. The 16:9 cutout processing of the human body detection frame is mainly due to the size of the skeleton key point model input setting of 16:9. In order to ensure that the image after cutout is resized to the specified size of the model without deformation, the shape ratio of the human body is guaranteed to be consistent with the original video.

Step3:人体姿态估计网络预测人体2D关键点Step 3: The human body posture estimation network predicts the 2D key points of the human body

本方案中人体姿态估计网络模型主要是采用了MobilenetV2网络结构(一种轻量型卷积神经网络，使用深度可分离卷积)，人体2D关键点模型流程图如下图2g所示，MobilenetV2骨骼网络结构如图2h、图2i所示：The human posture estimation network model in this scheme mainly adopts the MobilenetV2 network structure (a lightweight convolutional neural network that uses deep separable convolution). The flow chart of the human 2D key point model is shown in Figure 2g below, and the MobilenetV2 skeleton network structure is shown in Figure 2h and Figure 2i:

Step4:关键点滤波预处理Step 4: Key point filtering preprocessing

从Step3中得到人体姿态估计网络预测的人体2D关键点，直接应用深度学习模型输出关键点会出现抖动的情况，进而需要对模型输出结果进行滤波预处理，本方案人体2D关键点滤波预处理步骤主要有：The 2D key points of the human body predicted by the human posture estimation network are obtained from Step 3. Directly applying the deep learning model to output the key points will cause jitter, and then the model output results need to be filtered and preprocessed. The main steps of filtering and preprocessing of the 2D key points of the human body in this scheme are:

(1)设置需要滤波的图像帧数量N，依次输入人体姿态估计网络模型获取预测的人体2D关键点数据；(1) Setting the number of image frames N that need to be filtered, and sequentially inputting them into the human posture estimation network model to obtain the predicted human 2D key point data;

(2)对此N组图像帧对应的人体2D关键点数据进行相邻帧进行相减处理得到N-1的关键点差分数据；该步骤如图2j、图2k所示。(2) Subtracting adjacent frames of the 2D key point data of the human body corresponding to the N groups of image frames to obtain N-1 key point differential data; this step is shown in Figures 2j and 2k.

(3)将(2)中处理后的数据输入至本方案提出的人体关键点滤波模型进行训练及预测，模型结构如图2l所示。人体关键点滤波模型输入为N帧2D关键点数据和N-1帧关键点自适应稳定的时序差分数据；模型的输出滤波后的N帧2D关键点数据；人体关键点滤波模型作用是通过增加人体关键点N帧2D关键点和N-1帧关键点自适应稳定的时序差分数据作为输入进行特征组合进而预测滤波后的N帧2D关键点数据，滤波后的N帧2D关键点数据有效防止了关键点出现跳变抖动等情况。N帧2D关键点数据：对于每一帧数据包含14个骨骼关键点，每个关键点的坐标(x,y)代表了关键点在这帧图像中的位置。网络右分支同时接收连续N帧的数据，维度为14*2*N。N-1帧关键点自适应预处理时序差分数据：为使每帧关键点预测更稳定，本方案计算N帧图像关键点位置变化率均值并计算相邻两帧图像关键点变化率α_i,n，通过本方案提出的关键点变化率的换算公式resd_i,n，实现自适应稳定时序差分数据的计算，最后将计算得到的自适应稳定时序差分数据作为网络的左分支输入，数据维度为14*2*(N-1)。(3) The processed data in (2) is input into the human key point filtering model proposed in this scheme for training and prediction. The model structure is shown in Figure 21. The input of the human key point filtering model is N frames of 2D key point data and N-1 frames of adaptively stable time series difference data of key points; the output of the model is the filtered N frames of 2D key point data; the function of the human key point filtering model is to add N frames of 2D key points of human key points and N-1 frames of adaptively stable time series difference data of key points as input for feature combination and then predict the filtered N frames of 2D key point data. The filtered N frames of 2D key point data effectively prevent the key points from jumping and jittering. N frames of 2D key point data: For each frame of data, it contains 14 skeletal key points, and the coordinates (x, y) of each key point represent the position of the key point in this frame of image. The right branch of the network receives N consecutive frames of data at the same time, with a dimension of 14*2*N. N-1 frames of key point adaptively preprocessed time series difference data: In order to make the prediction of each frame of key points more stable, this scheme calculates the mean rate of change of the key point position in the N frames of image. And calculate the change rate of key points of two adjacent frames of images α _i,n , and use the conversion formula resd _i,n of the key point change rate proposed in this scheme to realize the calculation of adaptive stable time series difference data. Finally, the calculated adaptive stable time series difference data is used as the left branch input of the network, and the data dimension is 14*2*(N-1).

自适应稳定的时序差分数据计算公式如下：The adaptive and stable time series difference data calculation formula is as follows:

相邻两帧关键点差分：d_i,n＝p_i,n-p_i,n-1 Difference of key points between two adjacent frames: d _i,n = pi _,n - _pi,n-1

相邻两帧关键点变化率 The change rate of key points in two adjacent frames

N帧图像关键点位置变化率均值： The average change rate of key point positions in N frames:

自适应稳定时序差分数据： Adaptive stable time series difference data:

训练时网络损失函数同时优化误差如下：The network loss function optimizes the error at the same time during training as follows:

误差定义为 The error is defined as

总的损失函数为：Loss＝L_P The total loss function is: Loss = L _P

其中N为序列帧数，本文选定为8。i表示关键点索引，p_i,n为第n帧第i个关键点的图2e人体姿态网络预测的坐标，p_i,n-1为第n-1帧第i个关键点的图2e人体姿态网络预测的坐标；k_i,n为第n帧第i个关键点的图2e滤波网络结构预测坐标；g_i,n为第n帧第i个关键点的真实坐标。本方案设计的自适应稳定时序差分数据有效的平稳14个人体骨骼关键点坐标数据，当某个关键点跳变较大时，则自适应减少残差值进而使得预测结果跳变范围缩小。Where N is the number of sequence frames, which is selected as 8 in this paper. i represents the key point index, pi _,n is the coordinate predicted by the human posture network of the ith key point in the nth frame, pi _,n-1 is the coordinate predicted by the human posture network of the ith key point in the nth-1th frame; _ki,n is the coordinate predicted by the filter network structure of the ith key point in the nth frame; _gi,n is the real coordinate of the ith key point in the nth frame. The adaptive stable time series difference data designed in this scheme effectively stabilizes the coordinate data of 14 human skeleton key points. When a key point jumps greatly, the residual value is adaptively reduced to reduce the jump range of the prediction result.

训练数据的生成分为两个部分，一部分由2D人体姿态网络得到，如openpose。还有一部分采用关键点真实值加随机噪声扰动的方式生成。扰动方式为:The training data is generated in two parts. One part is obtained by 2D human posture network, such as openpose. The other part is generated by adding random noise disturbance to the real value of key points. The disturbance method is:

sp_i,n＝g_i,n*rand(0.8，1.2)sp _i,n =gi _,n *rand(0.8，1.2)

sp_i,n表示生成的带噪声的输入，rand(0.8，1.2)表示在0.8，1.2区间生成一个随机数。混合两种数据使训练数据更加多样性，可以提高模型的泛化能力。sp _i,n represents the generated noisy input, and rand(0.8, 1.2) represents generating a random number between 0.8 and 1.2. Mixing the two types of data makes the training data more diverse and improves the generalization ability of the model.

视频流帧号F<＝N时不处理，F>＝N时把[F-N,F]这N帧数据传入网络，得到滤波后的结果。When the video stream frame number F <= N, it is not processed. When F >= N, the N frames of data [F-N, F] are passed into the network to obtain the filtered result.

步骤三、NPC动作匹配Step 3: NPC action matching

通过步骤一和步骤二获取了用户语义分析模块后的情感分析、对话场景匹配结果以及用户的人体动作骨骼关键点等数据；为有效匹配出NPC的对应肢体动作，步骤三主要用于匹配最优NPC肢体动作。Through steps one and two, we obtain data such as sentiment analysis after the user semantic analysis module, dialogue scene matching results, and the user's human body movement skeletal key points; in order to effectively match the corresponding body movements of the NPC, step three is mainly used to match the optimal NPC body movements.

NPC人体动作匹配流程图如图2m所示：The NPC human motion matching flow chart is shown in Figure 2m:

根据用户对话内容进行步骤一中语义分析模块，与NPC语料库进行匹配输出n组相似文本匹配结果；Perform semantic analysis in step 1 according to the user conversation content, match it with the NPC corpus and output n groups of similar text matching results;

设置相似度阈值，对(1)的匹配结果进行过滤；Set the similarity threshold to filter the matching results of (1);

若过滤后的结果皆不在阈值范围内，则根据情感分析结果输出用户的情感值(积极、消极或中性)，NPC动作库中基础动作库预先创建对应情感标签进而匹配用户情感使得NPC生成相应的动作驱动。If the filtered results are all outside the threshold range, the user's emotion value (positive, negative or neutral) is output based on the emotion analysis result. The basic action library in the NPC action library pre-creates the corresponding emotion label and then matches the user's emotion so that the NPC generates the corresponding action drive.

判断过滤后匹配结果是否大于1组，若大于1则将用户人体骨骼关键点与对话场景的骨骼关键点向量进行相似度计算得到S1，文本相似度记为S2，计算加权值S0＝(1-p)*S1+p*S2，其中p为经验值，本方案设置为0.6。对S0进行倒序排序，输出最优匹配结果。若匹配结果只有1组则输出该匹配结果为最优匹配。Determine whether the matching result after filtering is greater than 1 group. If greater than 1, calculate the similarity between the user's human skeleton key points and the skeleton key points vector of the dialogue scene to obtain S1, record the text similarity as S2, and calculate the weighted value S0 = (1-p)*S1+p*S2, where p is an empirical value, which is set to 0.6 in this solution. Sort S0 in reverse order and output the best matching result. If there is only one matching result, output the matching result as the best match.

NPC端NPC side

步骤一、组建NPC数据库Step 1: Build an NPC database

NPC包括NPC人体动作库和NPC语料库，数据库的创建主要来源于大量电影对话场景数据以及短视频对话场景数据，对搜集的数据进行批量转换为对话文本内容特征向量转化(实现步骤与用户端步骤一相同)和人体动作视频转化为人体骨骼驱动数据(实现步骤与用户端步骤二相同)。NPC人体动作库中基础动作库主要为常见的挥手、鞠躬、祝福等动作。NPC includes NPC human action library and NPC corpus. The creation of the database is mainly based on a large amount of movie dialogue scene data and short video dialogue scene data. The collected data is batch converted into dialogue text content feature vector conversion (the implementation steps are the same as the user-side step 1) and human action video is converted into human skeleton drive data (the implementation steps are the same as the user-side step 2). The basic action library in the NPC human action library mainly includes common actions such as waving, bowing, blessing, etc.

步骤二、将用户端识别结果与NPC动作库进行动作匹配Step 2: Match the user-side recognition results with the NPC action library

匹配主要用到的是欧式距离相似度计算方法，详细步骤已在用户端步骤三中阐述。The matching mainly uses the Euclidean distance similarity calculation method, and the detailed steps have been described in step three on the user side.

步骤三、获得匹配后的结果对NPC进行动作驱动Step 3: Get the matching result and drive the NPC to take action

NPC人体动作驱动主要用到了用户端步骤二中人体骨骼驱动模块，输出对应的人体骨骼动作数据进而使得NPC进行肢体驱动。The NPC human motion drive mainly uses the human skeleton drive module in the second step of the user side, outputs the corresponding human skeleton motion data and enables the NPC to perform limb drive.

参见图3，本申请实施例提供一种显示装置，装置包括：Referring to FIG. 3 , an embodiment of the present application provides a display device, the device comprising:

获取模块301，用于获取用户与虚拟对象的对话内容和所述用户与所述虚拟对象对话过程中的视频数据；An acquisition module 301 is used to acquire the conversation content between the user and the virtual object and the video data of the conversation between the user and the virtual object;

识别模块302，用于对所述对话内容进行识别得到对话识别结果，以及对所述视频数据进行识别得到所述用户的第一骨骼关键点数据；The recognition module 302 is used to recognize the conversation content to obtain a conversation recognition result, and to recognize the video data to obtain the first skeleton key point data of the user;

确定模块303，用于根据所述对话识别结果和所述第一骨骼关键点数据，确定第二骨骼关键点数据；A determination module 303 is used to determine second skeleton key point data according to the dialogue recognition result and the first skeleton key point data;

显示模块304，用于根据所述第二骨骼关键点数据，控制所述虚拟对象的显示。The display module 304 is used to control the display of the virtual object according to the second skeleton key point data.

本申请实施例中的显示装置可以是电子设备，例如具有操作系统的电子设备，也可以是电子设备中的部件，例如集成电路或芯片。该电子设备可以是终端，也可以为除终端之外的其他设备。示例性的，终端可以包括但不限于上述所列举的终端11的类型，其他设备可以为服务器、网络附属存储器(Network Attached Storage，NAS)等，本申请实施例不作具体限定。The display device in the embodiment of the present application may be an electronic device, such as an electronic device with an operating system, or a component in an electronic device, such as an integrated circuit or a chip. The electronic device may be a terminal, or may be other devices other than a terminal. For example, the terminal may include but is not limited to the types of terminals 11 listed above, and other devices may be servers, network attached storage (NAS), etc., which are not specifically limited in the embodiment of the present application.

本申请实施例提供的显示装置能够实现图1至图2的方法实施例实现的各个过程，并达到相同的技术效果，为避免重复，这里不再赘述。The display device provided in the embodiment of the present application can implement each process implemented in the method embodiments of Figures 1 to 2 and achieve the same technical effect. To avoid repetition, it will not be repeated here.

如图4所示，本申请实施例还提供一种设备400，包括处理器401和存储器402，存储器402上存储有可在所述处理器401上运行的程序或指令，该程序或指令被处理器401执行时实现上述方法实施例的各个步骤，且能达到相同的技术效果，为避免重复，这里不再赘述。As shown in Figure 4, an embodiment of the present application also provides a device 400, including a processor 401 and a memory 402, and the memory 402 stores a program or instruction that can be executed on the processor 401. When the program or instruction is executed by the processor 401, the various steps of the above-mentioned method embodiment are implemented and the same technical effect can be achieved. To avoid repetition, it will not be repeated here.

本申请实施例还提供一种设备，包括处理器和通信接口，所述通信接口和所述处理器耦合，所述处理器用于运行程序或指令，实现如方法实施例的步骤。该设备实施例与上述设备方法实施例对应，上述方法实施例的各个实施过程和实现方式均可适用于该设备实施例中，且能达到相同的技术效果。The embodiment of the present application also provides a device, including a processor and a communication interface, wherein the communication interface is coupled to the processor, and the processor is used to run a program or instruction to implement the steps of the method embodiment. The device embodiment corresponds to the above-mentioned device method embodiment, and each implementation process and implementation method of the above-mentioned method embodiment can be applied to the device embodiment and can achieve the same technical effect.

具体地，本申请实施例还提供了一种设备。如图5所示，该设备500包括：天线51、射频装置52、基带装置53、处理器54和存储器55。天线51与射频装置52连接。在上行方向上，射频装置52通过天线51接收信息，将接收的信息发送给基带装置53进行处理。在下行方向上，基带装置53对要发送的信息进行处理，并发送给射频装置52，射频装置52对收到的信息进行处理后经过天线51发送出去。Specifically, an embodiment of the present application also provides a device. As shown in FIG5 , the device 500 includes: an antenna 51, a radio frequency device 52, a baseband device 53, a processor 54, and a memory 55. The antenna 51 is connected to the radio frequency device 52. In the uplink direction, the radio frequency device 52 receives information through the antenna 51 and sends the received information to the baseband device 53 for processing. In the downlink direction, the baseband device 53 processes the information to be sent and sends it to the radio frequency device 52. The radio frequency device 52 processes the received information and sends it out through the antenna 51.

以上实施例中设备执行的方法可以在基带装置53中实现，该基带装置53包括基带处理器。The method executed by the device in the above embodiment may be implemented in a baseband device 53, which includes a baseband processor.

基带装置53例如可以包括至少一个基带板，该基带板上设置有多个芯片，如图5所示，其中一个芯片例如为基带处理器，通过总线接口与存储器55连接，以调用存储器55中的程序，执行以上方法实施例中所示的网络设备操作。The baseband device 53 may include, for example, at least one baseband board, on which a plurality of chips are arranged, as shown in FIG5 , wherein one of the chips is, for example, a baseband processor, which is connected to the memory 55 through a bus interface to call a program in the memory 55 and execute the network device operations shown in the above method embodiment.

该设备还可以包括网络接口56，该接口例如为通用公共无线接口(Common PublicRadio Interface，CPRI)。The device may further include a network interface 56, such as a Common Public Radio Interface (CPRI).

具体地，本申请实施例的设备500还包括：存储在存储器55上并可在处理器54上运行的指令或程序，处理器54调用存储器55中的指令或程序执行图2所示各模块执行的方法，并达到相同的技术效果，为避免重复，故不在此赘述。Specifically, the device 500 of the embodiment of the present application also includes: instructions or programs stored in the memory 55 and executable on the processor 54. The processor 54 calls the instructions or programs in the memory 55 to execute the methods executed by the modules shown in Figure 2 and achieve the same technical effect. To avoid repetition, it will not be repeated here.

本申请实施例还提供一种可读存储介质，所述可读存储介质上存储有程序或指令，该程序或指令被处理器执行时实现上述方法实施例的各个过程，且能达到相同的技术效果，为避免重复，这里不再赘述。An embodiment of the present application also provides a readable storage medium, on which a program or instruction is stored. When the program or instruction is executed by a processor, the various processes of the above-mentioned method embodiment are implemented and the same technical effect can be achieved. To avoid repetition, it will not be repeated here.

其中，所述处理器为上述实施例中所述的终端中的处理器。所述可读存储介质，包括计算机可读存储介质，如计算机只读存储器ROM、随机存取存储器RAM、磁碟或者光盘等。在一些示例中，可读存储介质可以是非瞬态的可读存储介质。The processor is the processor in the terminal described in the above embodiment. The readable storage medium includes a computer readable storage medium, such as a computer read-only memory ROM, a random access memory RAM, a magnetic disk or an optical disk. In some examples, the readable storage medium may be a non-transient readable storage medium.

本申请实施例另提供了一种芯片，所述芯片包括处理器和通信接口，所述通信接口和所述处理器耦合，所述处理器用于运行程序或指令，实现上述方法实施例的各个过程，且能达到相同的技术效果，为避免重复，这里不再赘述。An embodiment of the present application further provides a chip, which includes a processor and a communication interface, wherein the communication interface is coupled to the processor, and the processor is used to run programs or instructions to implement the various processes of the above-mentioned method embodiment, and can achieve the same technical effect. To avoid repetition, it will not be repeated here.

应理解，本申请实施例提到的芯片还可以称为系统级芯片，系统芯片，芯片系统或片上系统芯片等。It should be understood that the chip mentioned in the embodiments of the present application can also be called a system-level chip, a system chip, a chip system or a system-on-chip chip, etc.

本申请实施例另提供了一种计算机程序/程序产品，所述计算机程序/程序产品被存储在存储介质中，所述计算机程序/程序产品被至少一个处理器执行以实现上述方法实施例的各个过程，且能达到相同的技术效果，为避免重复，这里不再赘述。The embodiments of the present application further provide a computer program/program product, which is stored in a storage medium and is executed by at least one processor to implement the various processes of the above-mentioned method embodiments and can achieve the same technical effect. To avoid repetition, it will not be described here.

需要说明的是，在本文中，术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含，从而使得包括一系列要素的过程、方法、物品或者装置不仅包括那些要素，而且还包括没有明确列出的其他要素，或者是还包括为这种过程、方法、物品或者装置所固有的要素。在没有更多限制的情况下，由语句“包括一个……”限定的要素，并不排除在包括该要素的过程、方法、物品或者装置中还存在另外的相同要素。此外，需要指出的是，本申请实施方式中的方法和装置的范围不限按示出或讨论的顺序来执行功能，还可包括根据所涉及的功能按基本同时的方式或按相反的顺序来执行功能，例如，可以按不同于所描述的次序来执行所描述的方法，并且还可以添加、省去或组合各种步骤。另外，参照某些示例所描述的特征可在其他示例中被组合。It should be noted that, in this article, the terms "comprise", "include" or any other variant thereof are intended to cover non-exclusive inclusion, so that a process, method, article or device including a series of elements includes not only those elements, but also other elements not explicitly listed, or also includes elements inherent to such process, method, article or device. In the absence of further restrictions, an element defined by the sentence "comprises one..." does not exclude the presence of other identical elements in the process, method, article or device including the element. In addition, it should be pointed out that the scope of the method and device in the embodiment of the present application is not limited to performing functions in the order shown or discussed, and may also include performing functions in a substantially simultaneous manner or in reverse order according to the functions involved, for example, the described method may be performed in an order different from that described, and various steps may also be added, omitted or combined. In addition, the features described with reference to certain examples may be combined in other examples.

通过以上的实施方式的描述，本领域的技术人员可以清楚地了解到上述实施例方法可借助计算机软件产品加必需的通用硬件平台的方式来实现，当然也可以通过硬件。该计算机软件产品存储在存储介质(如ROM、RAM、磁碟、光盘等)中，包括若干指令，用以使得终端或者网络侧设备执行本申请各个实施例所述的方法。Through the description of the above implementation methods, those skilled in the art can clearly understand that the above-mentioned embodiment method can be implemented by means of a computer software product plus a necessary general hardware platform, and of course, it can also be implemented by hardware. The computer software product is stored in a storage medium (such as ROM, RAM, disk, CD, etc.), including several instructions to enable the terminal or network side device to execute the method described in each embodiment of the present application.

上面结合附图对本申请的实施例进行了描述，但是本申请并不局限于上述的具体实施方式，上述的具体实施方式仅仅是示意性的，而不是限制性的，本领域的普通技术人员在本申请的启示下，在不脱离本申请宗旨和权利要求所保护的范围情况下，还可做出很多形式的实施方式，这些实施方式均属于本申请的保护之内。The embodiments of the present application are described above in conjunction with the accompanying drawings, but the present application is not limited to the above-mentioned specific implementation methods. The above-mentioned specific implementation methods are merely illustrative and not restrictive. Under the guidance of the present application, ordinary technicians in this field can also make many forms of implementation methods without departing from the purpose of the present application and the scope of protection of the claims, and these implementation methods are all within the protection of the present application.

Claims

1. A display method, comprising:

Acquire the conversation content between the user and the virtual object and the video data of the conversation between the user and the virtual object;

Recognize the conversation content to obtain a conversation recognition result, and recognize the video data to obtain first skeleton key point data of the user;

Determine second skeleton key point data according to the dialogue recognition result and the first skeleton key point data;

The display of the virtual object is controlled according to the second skeleton key point data.

2. The method according to claim 1, characterized in that the step of identifying the video data to obtain the first skeleton key point data comprises:

Determining an image frame sequence according to the video data;

Performing posture estimation on the image frame sequence to obtain two-dimensional key point data;

The two-dimensional key point data is filtered to obtain the first skeleton key point data.

3. The method according to claim 2, characterized in that the step of performing posture estimation on the image frame sequence to obtain two-dimensional key point data comprises:

Determine N image frames according to the image frame sequence;

Inputting the N image frames into a posture estimation model in sequence to obtain N two-dimensional key point data, wherein the posture estimation model is a convolutional neural network model;

Wherein, N is a positive integer greater than 1.

4. The method according to claim 3, characterized in that the filtering process on the two-dimensional key point data to obtain the first skeleton key point data comprises:

Subtract adjacent frames from each other to obtain N-1 key point differential data.

Determine adaptive stable time series differential data according to the N two-dimensional key point data and the N-1 key point differential data;

Inputting the adaptive stable time-series difference data and the N two-dimensional key point data into a filtering model to obtain the first skeleton key point data;

Among them, the network structure of the filtering model includes a fully connected residual layer.

5. The method according to claim 4, characterized in that the step of determining the adaptive stable time series differential data based on the N two-dimensional key point data and the N-1 key point differential data comprises:

Determine a key point data change rate between two adjacent frames according to the N two-dimensional key point data and the N-1 key point differential data;

Determine a mean value of the key point data change rate according to the key point data change rate of the two adjacent frames;

The adaptive stable time series differential data is determined according to the key point data change rate of the two adjacent frames, the key point data change rate average and the N-1 key point differential data.

6. The method according to claim 1, characterized in that

Determining second skeleton key point data according to the dialogue recognition result and the first skeleton key point data includes:

According to the dialogue recognition result, a plurality of first matching results are determined in a virtual object corpus in a virtual object database; wherein the dialogue recognition result includes an emotion value and a dialogue scene, the virtual object database includes a virtual object corpus and a virtual object action library, the virtual object corpus includes a plurality of pre-stored text contents, the virtual object action library includes a plurality of pre-stored skeleton key point data, and each of the pre-stored skeleton key point data has a corresponding emotion value and a dialogue scene;

Calculating the similarity between each of the first matching results and the conversation recognition result;

In the case where the similarities between all the first matching results and the dialogue recognition results exceed a preset similarity range, determining the first pre-stored skeleton key point data corresponding to the emotion value in the dialogue recognition result in the virtual object action library as the second skeleton key point data;

In the case that there is a first matching result whose similarity is within a preset similarity range among the multiple first matching results, at least one second pre-stored skeleton key point data is determined in the virtual object action library according to the dialogue scene in the dialogue recognition result; and the second skeleton key point data is determined in the at least one second pre-stored skeleton key point data according to the dialogue scene in the dialogue recognition result and the first skeleton key point data.

7. The method according to claim 6, characterized in that the determining the second skeleton key point data in the at least one second pre-stored skeleton key point data according to the dialogue scene in the dialogue recognition result and the first skeleton key point data comprises:

Calculate a first similarity of each of the second pre-stored skeleton key point data according to the dialogue scene in the dialogue recognition result, the first weight, and each of the second pre-stored skeleton key point data;

Calculate the second similarity of each of the second pre-stored skeleton key point data according to the first skeleton key point data, the second weight and each of the second pre-stored skeleton key point data;

Determine a third similarity of each of the second pre-stored skeleton key point data according to the first similarity and the second similarity;

The second pre-stored skeleton key point data satisfying a preset condition according to the third similarity is determined as the second skeleton key point data.

8. A display device, comprising:

An acquisition module, used to acquire the conversation content between the user and the virtual object and the video data of the conversation between the user and the virtual object;

A recognition module, used for recognizing the conversation content to obtain a conversation recognition result, and recognizing the video data to obtain the first skeleton key point data of the user;

A determination module, used for determining second skeleton key point data according to the dialogue recognition result and the first skeleton key point data;

A display module is used to control the display of the virtual object according to the second skeleton key point data.

9. A device, characterized in that it comprises a processor and a memory, wherein the memory stores a program or instruction that can be run on the processor, and when the program or instruction is executed by the processor, the steps of the display method according to any one of claims 1 to 7 are implemented.

10. A readable storage medium, characterized in that the readable storage medium stores a program or instruction, and when the program or instruction is executed by a processor, the steps of the display method according to any one of claims 1 to 7 are implemented.

11. A computer program product, characterized in that it comprises computer instructions, and when the computer instructions are executed by a processor, the steps of the display method according to any one of claims 1 to 7 are implemented.