WO2024055194A1 - 虚拟对象生成方法、编解码器训练方法及其装置 - Google Patents

虚拟对象生成方法、编解码器训练方法及其装置 Download PDF

Info

Publication number
WO2024055194A1
WO2024055194A1 PCT/CN2022/118712 CN2022118712W WO2024055194A1 WO 2024055194 A1 WO2024055194 A1 WO 2024055194A1 CN 2022118712 W CN2022118712 W CN 2022118712W WO 2024055194 A1 WO2024055194 A1 WO 2024055194A1
Authority
WO
WIPO (PCT)
Prior art keywords
feature vector
action
target
training
feature
Prior art date
Application number
PCT/CN2022/118712
Other languages
English (en)
French (fr)
Inventor
徐磊
Original Assignee
维沃移动通信有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 维沃移动通信有限公司 filed Critical 维沃移动通信有限公司
Priority to PCT/CN2022/118712 priority Critical patent/WO2024055194A1/zh
Publication of WO2024055194A1 publication Critical patent/WO2024055194A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T19/00Manipulating 3D models or images for computer graphics

Definitions

  • the present application belongs to the field of virtual reality technology, and specifically relates to a virtual object generation method, a codec training method and a device thereof.
  • the user's action posture is captured through the camera of the electronic device, and the action posture is estimated and analyzed. Then, a virtual object is generated based on the captured action posture, so that the user displays the virtual object in the virtual scene.
  • a virtual object related to the user's own posture can be generated based on the captured action posture.
  • the purpose of the embodiments of the present application is to provide a virtual object generation method, a codec training method and a device thereof, which can solve the problem of being unable to generate virtual objects related to the user's own posture.
  • embodiments of the present application provide a virtual object generation method, which method includes:
  • the first feature vector is determined based on the first action gesture
  • the second feature vector is determined based on the first feature vector
  • a virtual object is generated.
  • inventions of the present application provide a codec training method, which is applied to the method described in the first aspect.
  • the codec training method includes:
  • the encoder to be trained and the decoder to be trained are iteratively trained to obtain a target encoder and a target decoder.
  • a virtual object generation device which includes:
  • the extraction module is used to extract the action posture of the first human body feature corresponding to the target user to obtain the first action posture
  • Determining module used to determine the first feature vector and the second feature vector corresponding to the first action gesture, the first feature vector is determined based on the first action gesture, and the second feature vector is determined based on the first action gesture. Characteristic vector is determined;
  • a processing module configured to decode the first feature vector and the second feature vector to obtain a second motion posture, wherein the second motion posture is used to characterize a second human body feature corresponding to the target user;
  • a generating module configured to generate a virtual object based on the second action gesture.
  • inventions of the present application provide a codec training device, which is applied to the device described in the third aspect.
  • the codec training device includes:
  • a first generation module configured to input training data to the encoder to be trained and generate a target feature vector pair, where the training data includes at least one third action gesture;
  • a second generation module configured to input the target feature vector pair to the decoder to be trained and generate a fourth action gesture
  • a training module configured to iteratively train the encoder to be trained and the decoder to be trained based on the third action posture and the fourth action posture to obtain a target encoder and a target decoder.
  • inventions of the present application provide an electronic device.
  • the electronic device includes a processor, a memory, and a program or instructions stored on the memory and executable on the processor.
  • the program or instructions are When the processor is executed, the steps of the method described in the first aspect are implemented, or the steps of the method described in the second aspect are implemented.
  • embodiments of the present application provide a readable storage medium, which stores programs or instructions. When the programs or instructions are executed by a processor, the steps of the method described in the first aspect are implemented. , or implement the steps of the method described in the second aspect.
  • embodiments of the present application provide a chip.
  • the chip includes a processor and a communication interface.
  • the communication interface is coupled to the processor.
  • the processor is used to run programs or instructions to implement the first aspect. The method described in the second aspect, or the steps to implement the method described in the second aspect.
  • embodiments of the present application provide a computer program product.
  • the program product is stored in a storage medium.
  • the program product is executed by at least one processor to implement the method described in the first aspect, or to implement the method described in the second aspect. The steps of the method described in this aspect.
  • the action gesture corresponding to the first human body feature of the target user is extracted to obtain the first action gesture; the first feature vector and the second feature vector corresponding to the first action gesture are determined; and the first feature vector and the second feature vector are determined.
  • the feature vector is decoded to obtain a second action posture, which is used to represent the second human body feature corresponding to the target user; based on the second action posture, a virtual object is generated.
  • the target user when only the action posture corresponding to the first human body feature of the target user is extracted, that is, when the amount of data of the captured action posture is small, the target user can also be generated through the first action posture.
  • the corresponding virtual object is used to generate a virtual object related to the user's own posture.
  • Figure 1 is a flow chart of a virtual object generation method provided by an embodiment of the present application.
  • Figure 2 is one of the application scenario diagrams of the virtual object generation method provided by the embodiment of the present application.
  • Figure 3 is the second application scenario diagram of the virtual object generation method provided by the embodiment of the present application.
  • Figure 4 is the third application scenario diagram of the virtual object generation method provided by the embodiment of the present application.
  • Figure 5 is a flow chart of the codec training method provided by the embodiment of the present application.
  • Figure 6 is one of the application scenario diagrams of the codec training method provided by the embodiment of the present application.
  • Figure 7 is the second application scenario diagram of the codec training method provided by the embodiment of the present application.
  • Figure 8 is a structural diagram of a virtual object generation device provided by an embodiment of the present application.
  • Figure 9 is a structural diagram of a codec training device provided by an embodiment of the present application.
  • Figure 10 is a structural diagram of an electronic device provided by an embodiment of the present application.
  • Figure 11 is a hardware structure diagram of an electronic device provided by an embodiment of the present application.
  • first, second, etc. in the description and claims of this application are used to distinguish similar objects and are not used to describe a specific order or sequence. It is to be understood that the figures so used are interchangeable under appropriate circumstances so that the embodiments of the present application can be practiced in orders other than those illustrated or described herein, and that "first,” “second,” etc. are distinguished Objects are usually of one type, and the number of objects is not limited. For example, the first object can be one or multiple.
  • “and/or” in the description and claims indicates at least one of the connected objects, and the character “/" generally indicates that the related objects are in an "or” relationship.
  • Embodiments of the present application provide a virtual object generation method.
  • the virtual scenes applied by the virtual object generation method provided by the embodiments of the present application can be virtual conferences, virtual anchors and other scenes.
  • the virtual scene will be used below.
  • the object generation method is applied to the virtual meeting scenario as an example to illustrate.
  • FIG. 1 is a flow chart of a virtual object generation method provided by an embodiment of the present application.
  • the virtual object generation method provided by the embodiment of this application includes the following steps:
  • the Human Pose Estimation (HPE) algorithm can be used to process the target image, extract the action posture of the first human body feature in the target image, and obtain the first action posture.
  • the above-mentioned first human body characteristics are partial human body characteristics corresponding to the target user
  • the above-mentioned first action posture is data information of specific joint positions corresponding to when the target user performs a specific action.
  • the target user holds an electronic device and obtains a target image through the camera of the electronic device.
  • the target image in Figure 2 includes the first human body on the right side of the target user.
  • the first feature vector and the second feature vector corresponding to the first action gesture are determined, wherein the first feature vector is determined based on the first action gesture, and the second feature vector is determined based on the first feature Vector determined.
  • S103 Decode the first feature vector and the second feature vector to obtain a second action gesture.
  • the first feature vector and the second feature vector are decoded to obtain a second action posture, where the above-mentioned second action posture is used to represent the target. All human body characteristics of the user. For specific technical solutions on how to decode the first feature vector and the second feature vector to obtain the second action posture, please refer to subsequent embodiments.
  • An optional implementation method is to use a rendering engine (the rendering engine) to render the second action posture to generate a virtual object in the virtual scene.
  • the action gesture corresponding to the first human body feature of the target user is extracted to obtain the first action gesture; the first feature vector and the second feature vector corresponding to the first action gesture are determined; and the first feature vector and the second feature vector are determined.
  • the feature vector is decoded to obtain a second action posture, which is used to represent the second human body feature corresponding to the target user; based on the second action posture, a virtual object is generated.
  • the target user when only the action posture corresponding to the first human body feature of the target user is extracted, that is, when the amount of data of the captured action posture is small, the target user can also be generated through the first action posture.
  • the corresponding virtual object is used to generate a virtual object related to the user's own posture.
  • determining the first feature vector and the second feature vector corresponding to the first action gesture includes:
  • the first action gesture is encoded by a target encoder to obtain the first feature vector
  • a second feature vector is determined based on the feature vector database and the first feature vector.
  • the device that applies the virtual object generation method is preset with a feature vector database.
  • the feature vector database includes at least one feature vector pair, and each feature vector consists of two feature vectors.
  • a certain number of human body images can be manually selected, the HPE algorithm is used to determine the action postures in the human body images, and the action postures are encoded to obtain feature vector pairs, and the feature vector pairs are stored in the database.
  • the above-mentioned database storing feature vector pairs is also called a feature vector database.
  • the first action gesture is used as the input of the target encoder to obtain the first feature vector; after obtaining the first feature vector, the feature vector database is used to perform a query operation on the first feature vector to determine the second feature vector.
  • the feature vector database is used to perform query operations on the first feature vector and determine the second feature vector.
  • the above target encoder can be an encoder trained using a Generative Adversarial Network (GAN), an encoder trained using a Convolutional Neural Network (CNN), or other neural networks.
  • GAN Generative Adversarial Network
  • CNN Convolutional Neural Network
  • the encoder trained by the network is not specifically limited here.
  • the target image includes the human body features of the right side of the target user.
  • the target image is encoded using the target encoder to obtain the first feature vector; further, the preset
  • the feature vector database determines the second feature vector based on the first feature vector.
  • the target encoder is used to encode the first action gesture to obtain the first feature vector.
  • the second feature vector is determined.
  • the vector and the second feature vector determine the second action posture that represents all human body features of the target user, thereby generating a complete virtual object.
  • determining the second feature vector according to the feature vector database and the first feature vector includes:
  • One eigenvector of the first eigenvector pair, except the third eigenvector, is determined as the second eigenvector.
  • the first feature vector is queried in the feature vector database, and the feature vector with the smallest vector distance from the first feature vector in the feature vector database is determined as the third feature vector.
  • the vector distance between the first feature vector and each feature vector in the feature vector database can be calculated using an L1 norm algorithm, an L2 norm algorithm, or other methods.
  • the feature vector database includes at least one feature vector pair, each feature vector consisting of two feature vectors. Therefore, after determining the third feature vector, query the third feature vector, determine the first feature vector pair associated with the third feature vector in the feature vector database, and divide the first feature vector pair by the third feature vector An eigenvector other than , is determined as the second eigenvector.
  • the decoding process of the first feature vector and the second feature vector includes:
  • the second feature vector pair is decoded by a target decoder.
  • the above target decoder may be a decoder trained using a generative adversarial network, a decoder trained using a convolutional neural network, or a decoder trained using other neural networks, which are not specifically limited here.
  • the first feature vector and the second feature vector are combined into a second feature vector pair.
  • the above-mentioned second feature vector pair is used as an input of a target decoder, and the target decoder is used to decode the second feature vector pair.
  • the second feature vector pair composed of the first feature vector and the second feature vector is used as the input of the target decoder to obtain the second action posture. Further, Use a rendering engine to render the second action posture to generate the virtual object in Figure 4.
  • extracting the action gesture corresponding to the first human body feature of the target user to obtain the first action gesture includes:
  • Action posture extraction is performed on the first human body feature to obtain the first action posture.
  • the above-mentioned target image includes the first human body feature corresponding to the target user.
  • the above-mentioned first human body characteristics are partial human body characteristics corresponding to the target user.
  • the target user can hold the electronic device.
  • the target image captured by the camera of the electronic device is obtained.
  • the target user may not hold the electronic device, fix the electronic device and use the electronic device to take pictures.
  • the target image may also be obtained through the camera of the electronic device.
  • the acquired target image only includes the second human body features corresponding to the target user.
  • action gestures are extracted from the first human body features included in the target image to obtain the first action gestures.
  • the specific method of extracting action postures is consistent with the above-mentioned method of extracting action postures, and will not be repeated here.
  • An embodiment of the present application provides a codec training method.
  • the codec training method is applied to the above virtual object generation method. Please refer to Figure 5.
  • Figure 5 is a flow chart of the codec training method provided by an embodiment of the present application. .
  • the codec training method provided by the embodiment of this application includes the following steps:
  • S501 Input training data to the encoder to be trained to generate a target feature vector pair.
  • the above-mentioned training data includes at least one third action posture.
  • the above-mentioned training data may be arm action data of the target user.
  • the training data can be input to the encoder to be trained, and the encoder can be used to encode the training data to generate a target feature vector pair.
  • the training data is action posture data
  • the target feature vector pair consists of two target feature vectors.
  • “Feature vector 1” and “feature vector 2” in Figure 6 constitute a target feature vector pair
  • the encoder to be trained can be an encoder in a generative adversarial network.
  • S502 Input the target feature vector pair to the decoder to be trained to generate a fourth action gesture.
  • the target feature vector pair is used as the input of the decoder to be trained to generate the fourth action posture.
  • the decoder to be trained may be a decoder in a generative adversarial network.
  • the encoder to be trained and the decoder to be trained are iteratively trained.
  • the target is obtained encoder and target decoder. It should be noted that the above target encoder and target decoder can be applied to different virtual scenes according to the different virtual scenes corresponding to the training data.
  • the loss function of the generative adversarial network is adjusted when the difference between the third action pose and the fourth action pose reaches below a preset threshold. , confirm that the training of the encoder and decoder included in the generative adversarial network is completed, that is, the target encoder and target decoder are obtained.
  • the loss function value in the generative adversarial network can represent the similarity between the third action posture and the fourth action posture.
  • the method before inputting the training data to the encoder to be trained, the method further includes:
  • Action gesture extraction is performed on the at least one training image to obtain the training data.
  • the above training image set includes at least one training image, and the above training image is used to characterize the second human body feature.
  • a training image set is obtained, and action gestures are extracted for each training image included in the training image set to obtain training data.
  • the HPE algorithm can be used to extract action gestures from the training images, or other methods can be used.
  • the algorithm extracts action postures from training images, which is not specifically limited here.
  • Figure 7 shows the process of using the HPE algorithm to extract action postures from training images.
  • the training images are used as the input of the HPE algorithm, and the action postures corresponding to each training image are output, that is, training data.
  • the virtual object generation device 800 includes:
  • the extraction module 801 is used to extract the action posture of the first human body feature corresponding to the target user to obtain the first action posture;
  • Determining module 802 is used to determine the first feature vector and the second feature vector corresponding to the first action gesture.
  • the first feature vector is determined based on the first action gesture, and the second feature vector is determined based on the third action gesture.
  • One eigenvector is determined;
  • the processing module 803 is configured to decode the first feature vector and the second feature vector to obtain a second action gesture, where the second action gesture is used to characterize the second human body feature corresponding to the target user;
  • Generating module 804 configured to generate a virtual object based on the second action gesture.
  • the determination module 802 is specifically used to:
  • the first action gesture is encoded by a target encoder to obtain the first feature vector
  • a second feature vector is determined based on the feature vector database and the first feature vector.
  • the determination module 802 is also specifically used to:
  • a third feature vector associated with the first feature vector is determined, and the third feature vector is the one with the smallest vector distance from the first feature vector in the feature vector database.
  • a feature vector other than the third feature vector in the first feature vector pair is determined as the second feature vector.
  • processing module 803 is specifically used to:
  • the second feature vector pair is decoded by a target decoder.
  • the action gesture corresponding to the first human body feature of the target user is extracted to obtain the first action gesture; the first feature vector and the second feature vector corresponding to the first action gesture are determined; and the first feature vector and the second feature vector are determined.
  • the feature vector is decoded to obtain a second action posture, which is used to represent the second human body feature corresponding to the target user; based on the second action posture, a virtual object is generated.
  • the target user when only the action posture corresponding to the first human body feature of the target user is extracted, that is, when the amount of data of the captured action posture is small, the target user can also be generated through the first action posture.
  • the corresponding virtual object is used to generate a virtual object related to the user's own posture.
  • the codec training device 900 includes:
  • the first generation module 901 is used to input training data to the encoder to be trained and generate a target feature vector pair, where the training data includes at least one third action gesture;
  • the second generation module 902 is used to input the target feature vector pair to the decoder to be trained and generate a fourth action gesture
  • the training module 903 is configured to iteratively train the encoder to be trained and the decoder to be trained based on the third action posture and the fourth action posture to obtain a target encoder and a target decoder.
  • the codec training device 900 also includes:
  • An acquisition module configured to acquire a training image set, where the training image set includes at least one training image, and the training image is used to characterize the second human body feature;
  • An extraction module is used to extract action gestures from the at least one training image to obtain the training data.
  • the virtual object generation device and the codec training device in the embodiment of the present application may be electronic equipment, or may be components in electronic equipment, such as integrated circuits or chips.
  • the electronic device may be a terminal or other devices other than the terminal.
  • the electronic device can be a mobile phone, a tablet computer, a notebook computer, a handheld computer, a vehicle-mounted electronic device, a mobile internet device (Mobile Internet Device, MID), or augmented reality (AR)/virtual reality (VR).
  • the virtual object generation device and the codec training device in the embodiment of the present application may be devices with an operating system.
  • the operating system can be an Android operating system, an ios operating system, or other possible operating systems, which are not specifically limited in the embodiments of this application.
  • the virtual object generation device provided by the embodiment of the present application can implement each process implemented by the method embodiment in Figure 1. To avoid duplication, the details will not be described here.
  • the codec training device provided by the embodiment of the present application can implement each process implemented by the method embodiment in Figure 5. To avoid repetition, details will not be described here.
  • this embodiment of the present application also provides an electronic device 1000, including a processor 1001, a memory 1002, and programs or instructions stored on the memory 1002 and executable on the processor 1001.
  • the program or instruction When the program or instruction is executed by the processor 1001, it implements each process of the above-mentioned virtual object generation method embodiment, or implements each process of the above-mentioned codec training method embodiment, and can achieve the same technical effect. To avoid duplication, it is not mentioned here. Again.
  • the electronic devices in the embodiments of the present application include the above-mentioned mobile electronic devices and non-mobile electronic devices.
  • Figure 11 is a schematic diagram of the hardware structure of an electronic device that implements an embodiment of the present application.
  • the electronic device 1100 includes but is not limited to: radio frequency unit 1101, network module 1102, audio output unit 1103, input unit 1104, sensor 1105, display unit 1106, user input unit 1107, interface unit 1108, memory 1109, processor 1110, etc. part.
  • the electronic device 1100 may also include a power supply (such as a battery) that supplies power to various components.
  • the power supply may be logically connected to the processor 1110 through a power management system, thereby managing charging, discharging, and function through the power management system. Consumption management and other functions.
  • the structure of the electronic device shown in Figure 11 does not constitute a limitation on the electronic device.
  • the electronic device may include more or less components than shown in the figure, or combine certain components, or arrange different components, which will not be described again here. .
  • the processor 1110 is also used to extract the action posture of the first human body feature corresponding to the target user to obtain the first action posture;
  • a virtual object is generated.
  • the processor 1110 is also configured to encode the first action gesture through a target encoder to obtain the first feature vector;
  • a second feature vector is determined based on the feature vector database and the first feature vector.
  • the processor 1110 is further configured to determine a third feature vector associated with the first feature vector according to the feature vector database;
  • a feature vector other than the third feature vector in the first feature vector pair is determined as the second feature vector.
  • the processor 1110 is further configured to combine the first feature vector and the second feature vector into a second feature vector pair;
  • the second feature vector pair is decoded by a target decoder.
  • the input unit 1104 is used to obtain the target image
  • the processor 1110 is also configured to extract action postures from the first human body features to obtain the first action postures.
  • the action gesture corresponding to the first human body feature of the target user is extracted to obtain the first action gesture; the first feature vector and the second feature vector corresponding to the first action gesture are determined; and the first feature vector and the second feature vector are determined.
  • the feature vector is decoded to obtain a second action posture, which is used to represent the second human body feature corresponding to the target user; based on the second action posture, a virtual object is generated.
  • the target user when only the action posture corresponding to the first human body feature of the target user is extracted, that is, when the amount of data of the captured action posture is small, the target user can also be generated through the first action posture.
  • the corresponding virtual object is used to generate a virtual object related to the user's own posture.
  • the input unit 1104 is also used to input training data to the encoder to be trained and generate a target feature vector pair;
  • the processor 1110 is also configured to input the target feature vector pair to the decoder to be trained to generate a fourth action gesture
  • the encoder to be trained and the decoder to be trained are iteratively trained to obtain a target encoder and a target decoder.
  • the input unit 1104 is also used to obtain a training image set
  • the processor 1110 is also configured to extract action gestures from the at least one training image to obtain the training data.
  • the input unit 1104 may include a graphics processor (Graphics Processing Unit, GPU) 11041 and a microphone 11042.
  • the graphics processor 11041 is responsible for the image capture device (GPU) in the video capture mode or the image capture mode. Process the image data of still pictures or videos obtained by cameras (such as cameras).
  • the display unit 1106 may include a display panel 11061, which may be configured in the form of a liquid crystal display, an organic light emitting diode, or the like.
  • the user input unit 1107 includes at least one of a touch panel 11071 and other input devices 11072 .
  • Touch panel 11061 also called touch screen.
  • the touch panel 11061 may include two parts: a touch detection device and a touch controller.
  • Other input devices 11072 may include but are not limited to physical keyboards, function keys (such as volume control keys, switch keys, etc.), trackballs, mice, and joysticks, which will not be described again here.
  • Memory 1109 may be used to store software programs as well as various data.
  • the memory 1109 may mainly include a first storage area for storing programs or instructions and a second storage area for storing data, wherein the first storage area may store an operating system, an application program or instructions required for at least one function (such as a sound playback function, Image playback function, etc.) etc.
  • memory 1109 may include volatile memory or nonvolatile memory, or memory 1109 may include both volatile and nonvolatile memory.
  • non-volatile memory can be read-only memory (Read-Only Memory, ROM), programmable read-only memory (Programmable ROM, PROM), erasable programmable read-only memory (Erasable PROM, EPROM), electrically removable memory.
  • Volatile memory can be random access memory (Random Access Memory, RAM), static random access memory (Static RAM, SRAM), dynamic random access memory (Dynamic RAM, DRAM), synchronous dynamic random access memory (Synchronous DRAM, SDRAM), double data rate synchronous dynamic random access memory (Double Data Rate SDRAM, DDRSDRAM), enhanced synchronous dynamic random access memory (Enhanced SDRAM, ESDRAM), synchronous link dynamic random access memory (Synch link DRAM) , SLDRAM) and direct memory bus random access memory (Direct Rambus RAM, DRRAM).
  • RAM Random Access Memory
  • SRAM static random access memory
  • DRAM dynamic random access memory
  • synchronous dynamic random access memory Synchronous DRAM, SDRAM
  • Double data rate synchronous dynamic random access memory Double Data Rate SDRAM, DDRSDRAM
  • Enhanced SDRAM, ESDRAM synchronous link dynamic random access memory
  • Synch link DRAM synchronous link dynamic random access memory
  • SLDRAM direct memory bus random access memory
  • the processor 1110 may include one or more processing units; optionally, the processor 1110 integrates an application processor and a modem processor, where the application processor mainly handles operations related to the operating system, user interface, application programs, etc., Modem processors mainly process wireless communication signals, such as baseband processors. It can be understood that the above modem processor may not be integrated into the processor 1110.
  • Embodiments of the present application also provide a readable storage medium. Programs or instructions are stored on the readable storage medium. When the program or instructions are executed by a processor, each process of the above virtual object generation method embodiment is implemented, or the above processes are implemented. Each process of the codec training method embodiment can achieve the same technical effect. To avoid repetition, it will not be described again here.
  • the processor is the processor in the electronic device described in the above embodiment.
  • the readable storage media includes computer-readable storage media, such as computer read-only memory (ROM), random access memory (RAM), magnetic disks or optical disks.
  • An embodiment of the present application further provides a chip.
  • the chip includes a processor and a communication interface.
  • the communication interface is coupled to the processor.
  • the processor is used to run programs or instructions to implement the above embodiment of the virtual object generation method.
  • Each process of the above codec training method embodiment is implemented, and the same technical effect can be achieved. To avoid repetition, no details will be described here.
  • chips mentioned in the embodiments of this application may also be called system-on-chip, system-on-a-chip, system-on-a-chip or system-on-chip, etc.
  • Embodiments of the present application provide a computer program product.
  • the program product is stored in a storage medium.
  • the program product is executed by at least one processor to implement each process of the above virtual object generation method embodiment, or to implement the above codec training.
  • Each process of the method embodiment can achieve the same technical effect, so to avoid repetition, it will not be described again here.
  • the methods of the above embodiments can be implemented by means of software plus the necessary general hardware platform. Of course, it can also be implemented by hardware, but in many cases the former is better. implementation.
  • the technical solution of the present application can be embodied in the form of a computer software product that is essentially or contributes to the existing technology.
  • the computer software product is stored in a storage medium (such as ROM/RAM, disk , CD), including several instructions to cause a terminal (which can be a mobile phone, computer, server, or network device, etc.) to execute the methods described in various embodiments of this application.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Graphics (AREA)
  • Computer Hardware Design (AREA)
  • General Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • User Interface Of Digital Computer (AREA)

Abstract

本申请实施例提供了一种虚拟对象生成方法、编解码器训练方法及其装置,属于虚拟现实技术领域。上述虚拟对象生成方法包括:提取目标用户对应的第一人体特征的动作姿态,得到第一动作姿态;确定第一动作姿态对应的第一特征向量和第二特征向量,第一特征向量基于第一动作姿态确定,第二特征向量基于第一特征向量确定;对第一特征向量和第二特征向量进行解码处理,得到第二动作姿态,第二动作姿态用于表征目标用户对应的第二人体特征;基于第二动作姿态,生成虚拟对象。

Description

虚拟对象生成方法、编解码器训练方法及其装置 技术领域
本申请属于虚拟现实技术领域,具体涉及一种虚拟对象生成方法、编解码器训练方法及其装置。
背景技术
随着虚拟现实技术的成熟以及“元宇宙”概念的兴起,在一些虚拟场景中,例如虚拟会议、虚拟主播等场景,通过电子设备的摄像头捕捉用户的动作姿态,对动作姿态进行估计和分析,进而根据捕捉到的动作姿态生成虚拟对象,这样,用户以该虚拟对象的方式在虚拟场景中进行展示。
然而,上述过程中,可以基于捕捉到的动作姿态,生成与用户自身的姿态相关的虚拟对象。
发明内容
本申请实施例的目的是一种虚拟对象生成方法、编解码器训练方法及其装置,能够解决不能生成与用户自身姿态相关的虚拟对象的问题。
第一方面,本申请实施例提供了一种虚拟对象生成方法,该方法包括:
提取目标用户对应的第一人体特征的动作姿态,得到第一动作姿态;
确定所述第一动作姿态对应的第一特征向量和第二特征向量,所述第一特征向量基于所述第一动作姿态确定,所述第二特征向量基于所述第一特征向量确定;
对所述第一特征向量和所述第二特征向量进行解码处理,得到第二动作姿态,所述第二动作姿态用于表征所述目标用户对应的第二人体特征;
基于所述第二动作姿态,生成虚拟对象。
第二方面,本申请实施例提供了一种编解码器训练方法,应用于第一方面所述的方法,该编解码器训练方法包括:
将训练数据输入至待训练的编码器,生成目标特征向量对,所述训练数据包括至少一个第三动作姿态;
将所述目标特征向量对输入至待训练的解码器,生成第四动作姿态;
基于所述第三动作姿态和所述第四动作姿态,对所述待训练的编码器和所述待训练的解码器进行迭代训练,得到目标编码器和目标解码器。
第三方面,本申请实施例提供了一种虚拟对象生成装置,该装置包括:
提取模块,用于提取目标用户对应的第一人体特征的动作姿态,得到第一动作姿态;
确定模块,用于确定所述第一动作姿态对应的第一特征向量和第二特征向量,所述第一特征向量基于所述第一动作姿态确定,所述第二特征向量基于所述第一特征向量确定;
处理模块,用于对所述第一特征向量和所述第二特征向量进行解码处理,得到第二动作姿态,所述第二动作姿态用于表征所述目标用户对应的第二人体特征;
生成模块,用于基于所述第二动作姿态,生成虚拟对象。
第四方面,本申请实施例提供了一种编解码器训练装置,应用于第三方面所述的装置,该编解码器训练装置包括:
第一生成模块,用于将训练数据输入至待训练的编码器,生成目标特征向量对,所述训练数据包括至少一个第三动作姿态;
第二生成模块,用于将所述目标特征向量对输入至待训练的解码器,生成第四动作姿态;
训练模块,用于基于所述第三动作姿态和所述第四动作姿态,对待训练的编码器和所述待训练的解码器进行迭代训练,得到目标编码器和目标解码器。
第五方面,本申请实施例提供了一种电子设备,该电子设备包括处理器、存储器及存储在所述存储器上并可在所述处理器上运行的程序或指令,所述程序或指令被所述处理器执行时实现如第一方面所述的方法的步骤,或者实现如第二方面所述的方法的步骤。
第六方面,本申请实施例提供了一种可读存储介质,所述可读存储介质上存储程序或指令,所述程序或指令被处理器执行时实现如第一方面所述的方法的步骤,或者实现如第二方面所述的方法的步骤。
第七方面,本申请实施例提供了一种芯片,所述芯片包括处理器和通信接口,所述通信接口和所述处理器耦合,所述处理器用于运行程序或指令,实现如第一方面所述的方法,或者实现如第二方面所述的方法的步骤。
第八方面,本申请实施例提供一种计算机程序产品,该程序产品被存储在存储介质中,该程序产品被至少一个处理器执行以实现如第一方面所述的方法,或者实现如第二方面所述的方法的步骤。
本申请实施例中,提取目标用户对应的第一人体特征的动作姿态,得到第一动作姿态;确定第一动作姿态对应的第一特征向量和第二特征向量;对第一特征向量和第二特征向量进行解码处理,得到第二动作姿态,第二动作姿态用于表征目标用户对应的第二人体特征;基于第二动作姿态,生成虚拟对象。本申请实施例中,可以当只提取到目标用户对应的第一人体特征的动作姿态时,即捕捉到的动作姿态的数据量较少的情况下,也能通过该第一动作姿态生成目标用户对应的虚拟对象,以此生成与用户自身的姿态相关的虚拟对象。
附图说明
图1是本申请实施例提供的虚拟对象生成方法的流程图;
图2是本申请实施例提供的虚拟对象生成方法的应用场景图之一;
图3是本申请实施例提供的虚拟对象生成方法的应用场景图之二;
图4是本申请实施例提供的虚拟对象生成方法的应用场景图之三;
图5是本申请实施例提供的编解码器训练方法的流程图;
图6是本申请实施例提供的编解码器训练方法的应用场景图之一;
图7是本申请实施例提供的编解码器训练方法的应用场景图之二;
图8是本申请实施例提供的虚拟对象生成装置的结构图;
图9是本申请实施例提供的编解码器训练装置的结构图;
图10是本申请实施例提供的电子设备的结构图;
图11是本申请实施例提供的电子设备的硬件结构图。
具体实施方式
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚地描述,显然,所描述的实施例是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员获得的所有其他实施例,都属于本申请保护的范围。
本申请的说明书和权利要求书中的术语“第一”、“第二”等是用于区别类似的对象,而不用于描述特定的顺序或先后次序。应该理解这样使用的数据在适当情况下可以互换,以便本申请的实施例能够以除了在这里图示或描述的那些以外的顺序实施,且“第一”、“第二”等所区分的对象通常为一类,并不限定对象的个数,例如第一对象可以是一个,也可以是多个。此外,说明书以及权利要求中“和/或”表示所连接对象的至少其中之一,字符“/”,一般表示前后关联对象是一种“或”的关系。
下面结合附图,通过具体的实施例及其应用场景对本申请实施例提供的虚拟对象生成方法进行详细地说明。
本申请实施例提供了一种虚拟对象生成方法,本申请实施例提供的虚拟对象生成方法应用的虚拟场景可以是虚拟会议、虚拟主播等场景,出于清楚阐述技术方案的需要,下面以该虚拟对象生成方法应用于虚拟会议场景为例进行阐述。
请参阅图1,图1是本申请实施例提供的虚拟对象生成方法的流程图。本申请实施例提供的虚拟对象生成方法包括以下步骤:
S101,提取目标用户对应的第一人体特征的动作姿态,得到第一动作姿态。
本步骤中,可以使用人体姿态估计(Human Pose Estimation,HPE)算法对目标图像进行处理,提取目标图像中第一人体特征的动作姿态,获得第一动作姿态。其中,上述第一人体特征为目标用户对应的部分人体特征,上述第一动作姿态为目标用户做出特定动作时对应的特定关节位置的数据信息。
为便于理解,请参阅图2,在图2示出的场景中,目标用户手持电子设备,通过电子设备的摄像头获取到目标图像,图2中的目标图像包括目标用户右侧人体的第一人体特征,进而使用HPE算法提取目标图像中提取上述第一人体特征的动作姿态,获得第一动作姿态。
S102,确定所述第一动作姿态对应的第一特征向量和第二特征向量。
本步骤中,在得到第一动作姿态之后,确定第一动作姿态对应的第一特征向量和第二特征向量,其中,第一特征向量基于第一动作姿态确定,第二特征向量基于第一特征向量确定。具体的如何确定第一动作姿态对应的第一特征向量和第二特征向量的技术方案,请参阅后续实施例。
S103,对所述第一特征向量和所述第二特征向量进行解码处理,得到第二动作姿态。
本步骤中,在得到第一特征向量和第二特征向量之后,对第一特征向量和所述第二特征向量进行解码处理,得到第二动作姿态,其中,上述第二动作姿态用于表征目标用户的全部人体特征。具体的如何对第一特征向量和第二特征向量进行解码处理,得到第二动作姿态的技术方案,请参阅后续实施例。
S104,基于所述第二动作姿态,生成虚拟对象。
可选地实施方式为,使用渲染引擎(the rendering engine)对第二动作姿态进行渲染(render),以此生成虚拟场景中的虚拟对象。
本申请实施例中,提取目标用户对应的第一人体特征的动作姿态,得到第一动作姿态;确定第一动作姿态对应的第一特征向量和第二特征向量;对第一特征向量和第二特征向量进行解码处理,得到第二动作姿态,第二动作姿态用于表征目标用户对应的第二人体特征;基于第二动作姿态,生成虚拟对象。本申请实施例中,可以当只提取到目标用户对应的第一人体特征的动作姿态时,即捕捉到的动作姿态的数据量较少的情况下,也能通过该第一动作姿态生成目标用户对应的虚拟对象,以此生成与用户自身的姿态相关的虚拟对象。
可选地,所述确定所述第一动作姿态对应的第一特征向量和第二特征向量包括:
通过目标编码器对所述第一动作姿态进行编码处理,得到所述第一特征向量;
根据特征向量数据库和所述第一特征向量,确定第二特征向量。
本实施例中,应用虚拟对象生成方法的装置预先设置有特征向量数据库, 该特征向量数据库包括至少一个特征向量对,每个特征向量由两个特征向量组成。
可选地,可以人工挑选出一定数量的人体图像,使用HPE算法确定上述人体图像中的动作姿态,并对上述动作姿态进行编码处理,得到特征向量对,将上述特征向量对存储至数据库中,上述存储有特征向量对的数据库又称为特征向量数据库。
本实施例中,将第一动作姿态作为目标编码器的输入,得到第一特征向量;在得到第一特征向量之后,使用特征向量数据库对第一特征向量执行查询操作,确定第二特征向量。具体的如何使用特征向量数据库对第一特征向量执行查询操作,确定第二特征向量的技术方案,请参阅后续实施例。
可选地,上述目标编码器可以是使用生成对抗网络(Generative Adversarial Network,GAN)训练的编码器,也可以是使用卷积神经网络(Convolutional Neural Networks,CNN)训练的编码器,或使用其他神经网络训练的编码器,在此不作具体限定。为便于理解,请参阅图3,如图3所示,目标图像包括目标用户右侧人体的人体特征,使用目标编码器对目标图像进行编码处理,得到第一特征向量;进一步的,使得预设的特征向量数据库,基于第一特征向量,确定第二特征向量。
本实施例中,使用目标编码器对第一动作姿态进行编码处理,得到第一特征向量,根据特征向量数据库和第一特征向量,确定第二特征向量,在后续步骤中,基于上述第一特征向量和第二特征向量确定表征目标用户全部人体特征的第二动作姿态,进而生成完整的虚拟对象。
可选地,所述根据特征向量数据库和所述第一特征向量,确定第二特征向量包括:
根据所述特征向量数据库,确定与所述第一特征向量相关联的第三特征向量;
根据所述特征向量数据库,确定所述第三特征向量相关联的第一特征向量对;
将所述第一特征向量对中除所述第三特征向量之外的一个特征向量,确定为所述第二特征向量。
本实施例中,在特征向量数据库对所述第一特征向量进行查询,将特征向量数据库中与第一特征向量之间的向量距离最小的特征向量,确定为第三特征向量。可选地,可以使用L1范数算法、L2范数算法或者其他方式计算第一特征向量与特征向量数据库中每个特征向量之间的向量距离。
如上所述,特征向量数据库包括至少一个特征向量对,每个特征向量由两个特征向量组成。因此,在确定第三特征向量之后,对第三特征向量进行查询,确定特征向量数据库中与第三特征向量相关联的第一特征向量对,并将第一特征向量对中除第三特征向量之外的一个特征向量,确定为第二特征向量。
可选地,所述对所述第一特征向量和所述第二特征向量进行解码处理包括:
将所述第一特征向量和所述第二特征向量组合成第二特征向量对;
通过目标解码器,对所述第二特征向量对进行解码处理。
上述目标解码器可以是使用生成对抗网络训练的解码器,也可以是使用卷积神经网络训练的解码器,或使用其他神经网络训练的解码器,在此不作具体限定。
本实施例中,在得到第一特征向量和第二特征向量之后,由于目标解码器的输入数据为特征向量对,因此将第一特征向量和第二特征向量组合成第二特征向量对。将上述第二特征向量对作为目标解码器的输入,使用该目标解码器对第二特征向量对进行解码处理。
为便于理解技术方案,请参阅图4,如图4所示,将第一特征向量和第二特征向量组成的第二特征向量对作为目标解码器的输入,得到第二动作姿态,进一步的,使用渲染引擎对第二动作姿态进行渲染,生成图4中的虚拟对象。
可选地,所述提取目标用户对应的第一人体特征的动作姿态,得到第一动作姿态包括:
获取目标图像;
对所述第一人体特征进行动作姿态提取,得到所述第一动作姿态。
上述目标图像包括目标用户对应的第一人体特征。可选地,上述第一人 体特征为目标用户对应的部分人体特征。
在一可选地实施场景中,目标用户可以手持电子设备,这种实施场景下,获取电子设备的摄像头拍摄到的目标图像。在另一实施场景中,目标用户也可以不手持电子设备,将电子设备固定并使用电子设备拍照,这种实施场景下,也可以通过电子设备的摄像头获取到目标图像。
在上述实施场景中,若目标用户只有部分人体出现在摄像头拍摄到的画面中,则获取到的目标图像中只包括目标用户对应的第二人体特征。
本实施例中,在获取目标图像之后,对目标图像包括的第一人体特征进行动作姿态提取,得到第一动作姿态。具体的进行动作姿态提取的方式与上述动作姿态提取的方式一致,在此不做重复阐述。
本申请实施例提供了一种编解码器训练方法,该编解码器训练方法应用于上述虚拟对象生成方法,请参阅图5,图5是本申请实施例提供的编解码器训练方法的流程图。本申请实施例提供的编解码器训练方法包括以下步骤:
S501,将训练数据输入至待训练的编码器,生成目标特征向量对。
上述训练数据包括至少一个第三动作姿态,可选地,上述训练数据可以是目标用户的手臂动作数据。
请参阅图6,本步骤中,可选地,可以将训练数据输入至待训练的编码器,使用该编码器对训练数据进行编码处理,生成目标特征向量对。其中,训练数据为动作姿态数据,目标特征向量对由两个目标特征向量组成。图6中的“特征向量1”和“特征向量2”构成一个目标特征向量对,待训练的编码器可以为生成对抗网络中的编码器。
S502,将所述目标特征向量对输入至待训练的解码器,生成第四动作姿态。
请参阅图6,本步骤中,在得到目标特征向量对后,将目标特征向量对作为待训练的解码器的输入,生成第四动作姿态。其中,待训练的解码器可以为生成对抗网络中的解码器。
S503,基于所述第三动作姿态和所述第四动作姿态,对所述待训练的编码器和所述待训练的解码器进行迭代训练,得到目标编码器和目标解码器。
本步骤中,基于第三动作姿态和第四动作姿态之间的差异,对待训练的 编码器和所述待训练的解码器进行迭代训练,在编码器和解码器训练完成的情况下,得到目标编码器和目标解码器。需要说明的是,上述目标编码器和目标解码器,根据训练数据对应的虚拟场景的不同,可以应用于不同的虚拟场景。
可选地,在编码器和解码器应用于生成对抗网络的情况下,调整生成对抗网络的损失函数,在第三动作姿态和第四动作姿态之间的差异达到低于预设阈值的情况下,确定生成对抗网络包括的编码器和解码器训练完成,即得到目标编码器和目标解码器。其中,生成对抗网络中的损失函数值可以表征第三动作姿态和第四动作姿态之间的相似度。
可选地,所述将训练数据输入至待训练的编码器之前,所述方法还包括:
获取训练图像集;
对所述至少一个训练图像进行动作姿态提取,得到所述训练数据。
上述训练图像集包括至少一个训练图像,上述训练图像用于表征第二人体特征。
本实施例中,获取训练图像集,并对训练图像集包括的每个训练图像进行动作姿态提取,获得训练数据,可选地,可以使用HPE算法对训练图像进行动作姿态提取,也可以使用其他算法对训练图像进行动作姿态提取,在此不作具体限定。
请一并参阅图7,图7示出的是使用HPE算法对训练图像进行动作姿态提取的过程,将训练图像作为HPE算法的输入,输出得到每个训练图像对应的动作姿态,即训练数据。
下面结合附图,通过具体的实施例及其应用场景对本申请实施例提供的虚拟对象生成装置进行详细地说明。
如图8所示,虚拟对象生成装置800包括:
提取模块801,用于提取目标用户对应的第一人体特征的动作姿态,得到第一动作姿态;
确定模块802,用于确定所述第一动作姿态对应的第一特征向量和第二特征向量,所述第一特征向量基于所述第一动作姿态确定,所述第二特征向量基于所述第一特征向量确定;
处理模块803,用于对所述第一特征向量和所述第二特征向量进行解码处理,得到第二动作姿态,所述第二动作姿态用于表征所述目标用户对应的第二人体特征;
生成模块804,用于基于所述第二动作姿态,生成虚拟对象。
可选地,所述确定模块802,具体用于:
通过目标编码器对所述第一动作姿态进行编码处理,得到所述第一特征向量;
根据特征向量数据库和所述第一特征向量,确定第二特征向量。
可选地,所述确定模块802,还具体用于:
根据所述特征向量数据库,确定与所述第一特征向量相关联的第三特征向量,所述第三特征向量为所述特征向量数据库中与所述第一特征向量之间的向量距离最小的特征向量;
根据所述特征向量数据库,确定所述第三特征向量相关联的第一特征向量对,所述特征向量数据库包括至少一个特征向量对;
将所述第一特征向量对中除所述第三特征向量之外的一个特征向量,确定为所述第二特征向量。
可选地,所述处理模块803,具体用于:
将所述第一特征向量和所述第二特征向量组合成第二特征向量对;
通过目标解码器,对所述第二特征向量对进行解码处理。
本申请实施例中,提取目标用户对应的第一人体特征的动作姿态,得到第一动作姿态;确定第一动作姿态对应的第一特征向量和第二特征向量;对第一特征向量和第二特征向量进行解码处理,得到第二动作姿态,第二动作姿态用于表征目标用户对应的第二人体特征;基于第二动作姿态,生成虚拟对象。本申请实施例中,可以当只提取到目标用户对应的第一人体特征的动作姿态时,即捕捉到的动作姿态的数据量较少的情况下,也能通过该第一动作姿态生成目标用户对应的虚拟对象,以此生成与用户自身的姿态相关的虚拟对象。
下面结合附图,通过具体的实施例及其应用场景对本申请实施例提供的编解码器训练装置进行详细地说明。
如图9所示,编解码器训练装置900包括:
第一生成模块901,用于将训练数据输入至待训练的编码器,生成目标特征向量对,所述训练数据包括至少一个第三动作姿态;
第二生成模块902,用于将所述目标特征向量对输入至待训练的解码器,生成第四动作姿态;
训练模块903,用于基于所述第三动作姿态和所述第四动作姿态,对所述待训练的编码器和所述待训练的解码器进行迭代训练,得到目标编码器和目标解码器。
可选地,所述编解码器训练装置900还包括:
获取模块,用于获取训练图像集,所述训练图像集包括至少一个训练图像,所述训练图像用于表征第二人体特征;
提取模块,用于对所述至少一个训练图像进行动作姿态提取,得到所述训练数据。
本申请实施例中的虚拟对象生成装置和编解码器训练装置可以是电子设备,也可以是电子设备中的部件、例如集成电路或芯片。该电子设备可以是终端,也可以为除终端之外的其他设备。示例性的,电子设备可以为手机、平板电脑、笔记本电脑、掌上电脑、车载电子设备、移动上网装置(Mobile Internet Device,MID)、增强现实(augmented reality,AR)/虚拟现实(virtual reality,VR)设备、机器人、可穿戴设备、超级移动个人计算机(ultra-mobile personal computer,UMPC)、上网本或者个人数字助理(personal digital assistant,PDA)等,还可以为服务器、网络附属存储器(Network Attached Storage,NAS)、个人计算机(personal computer,PC)、电视机(television,TV)、柜员机或者自助机等,本申请实施例不作具体限定。
本申请实施例中的虚拟对象生成装置和编解码器训练装置可以为具有操作系统的装置。该操作系统可以为安卓(Android)操作系统,可以为ios操作系统,还可以为其他可能的操作系统,本申请实施例不作具体限定。
本申请实施例提供的虚拟对象生成装置能够实现图1的方法实施例实现的各个过程,为避免重复,这里不再赘述。
本申请实施例提供的编解码器训练装置能够实现图5的方法实施例实现 的各个过程,为避免重复,这里不再赘述。
可选地,如图10所示,本申请实施例还提供一种电子设备1000,包括处理器1001,存储器1002,存储在存储器1002上并可在所述处理器1001上运行的程序或指令,该程序或指令被处理器1001执行时实现上述虚拟对象生成方法实施例的各个过程,或者实现上述编解码器训练方法实施例的各个过程,且能达到相同的技术效果,为避免重复,这里不再赘述。
需要说明的是,本申请实施例中的电子设备包括上述所述的移动电子设备和非移动电子设备。
图11为实现本申请实施例的一种电子设备的硬件结构示意图。
该电子设备1100包括但不限于:射频单元1101、网络模块1102、音频输出单元1103、输入单元1104、传感器1105、显示单元1106、用户输入单元1107、接口单元1108、存储器1109、以及处理器1110等部件。
本领域技术人员可以理解,电子设备1100还可以包括给各个部件供电的电源(比如电池),电源可以通过电源管理系统与处理器1110逻辑相连,从而通过电源管理系统实现管理充电、放电、以及功耗管理等功能。图11中示出的电子设备结构并不构成对电子设备的限定,电子设备可以包括比图示更多或更少的部件,或者组合某些部件,或者不同的部件布置,在此不再赘述。
处理器1110,还用于提取目标用户对应的第一人体特征的动作姿态,得到第一动作姿态;
确定所述第一动作姿态对应的第一特征向量和第二特征向量;
对所述第一特征向量和所述第二特征向量进行解码处理,得到第二动作姿态;
基于所述第二动作姿态,生成虚拟对象。
其中,处理器1110,还用于通过目标编码器对所述第一动作姿态进行编码处理,得到所述第一特征向量;
根据特征向量数据库和所述第一特征向量,确定第二特征向量。
其中,处理器1110,还用于根据所述特征向量数据库,确定与所述第一特征向量相关联的第三特征向量;
根据所述特征向量数据库中,确定所述第三特征向量相关联的第一特征 向量对;
将所述第一特征向量对中除所述第三特征向量之外的一个特征向量,确定为所述第二特征向量。
其中,处理器1110,还用于将所述第一特征向量和所述第二特征向量组合成第二特征向量对;
通过目标解码器,对所述第二特征向量对进行解码处理。
其中,输入单元1104,用于获取目标图像;
处理器1110,还用于对所述第一人体特征进行动作姿态提取,得到所述第一动作姿态。
本申请实施例中,提取目标用户对应的第一人体特征的动作姿态,得到第一动作姿态;确定第一动作姿态对应的第一特征向量和第二特征向量;对第一特征向量和第二特征向量进行解码处理,得到第二动作姿态,第二动作姿态用于表征目标用户对应的第二人体特征;基于第二动作姿态,生成虚拟对象。本申请实施例中,可以当只提取到目标用户对应的第一人体特征的动作姿态时,即捕捉到的动作姿态的数据量较少的情况下,也能通过该第一动作姿态生成目标用户对应的虚拟对象,以此生成与用户自身的姿态相关的虚拟对象。
其中,输入单元1104,还用于将训练数据输入至待训练的编码器,生成目标特征向量对;
处理器1110,还用于将所述目标特征向量对输入至待训练的解码器,生成第四动作姿态;
基于所述第三动作姿态和所述第四动作姿态,对所述待训练的编码器和所述待训练的解码器进行迭代训练,得到目标编码器和目标解码器。
其中,输入单元1104,还用于获取训练图像集;
处理器1110,还用于对所述至少一个训练图像进行动作姿态提取,得到所述训练数据。
应理解的是,本申请实施例中,输入单元1104可以包括图形处理器(Graphics Processing Unit,GPU)11041和麦克风11042,图形处理器11041对在视频捕获模式或图像捕获模式中由图像捕获装置(如摄像头)获得的静 态图片或视频的图像数据进行处理。显示单元1106可包括显示面板11061,可以采用液晶显示器、有机发光二极管等形式来配置显示面板11061。用户输入单元1107包括触控面板11071以及其他输入设备11072中的至少一种。触控面板11061,也称为触摸屏。触控面板11061可包括触摸检测装置和触摸控制器两个部分。其他输入设备11072可以包括但不限于物理键盘、功能键(比如音量控制按键、开关按键等)、轨迹球、鼠标、操作杆,在此不再赘述。
存储器1109可用于存储软件程序以及各种数据。存储器1109可主要包括存储程序或指令的第一存储区和存储数据的第二存储区,其中,第一存储区可存储操作系统、至少一个功能所需的应用程序或指令(比如声音播放功能、图像播放功能等)等。此外,存储器1109可以包括易失性存储器或非易失性存储器,或者,存储器1109可以包括易失性和非易失性存储器两者。其中,非易失性存储器可以是只读存储器(Read-Only Memory,ROM)、可编程只读存储器(Programmable ROM,PROM)、可擦除可编程只读存储器(Erasable PROM,EPROM)、电可擦除可编程只读存储器(Electrically EPROM,EEPROM)或闪存。易失性存储器可以是随机存取存储器(Random Access Memory,RAM),静态随机存取存储器(Static RAM,SRAM)、动态随机存取存储器(Dynamic RAM,DRAM)、同步动态随机存取存储器(Synchronous DRAM,SDRAM)、双倍数据速率同步动态随机存取存储器(Double Data Rate SDRAM,DDRSDRAM)、增强型同步动态随机存取存储器(Enhanced SDRAM,ESDRAM)、同步连接动态随机存取存储器(Synch link DRAM,SLDRAM)和直接内存总线随机存取存储器(Direct Rambus RAM,DRRAM)。本申请实施例中的存储器1109包括但不限于这些和任意其它适合类型的存储器。
处理器1110可包括一个或多个处理单元;可选的,处理器1110集成应用处理器和调制解调处理器,其中,应用处理器主要处理涉及操作系统、用户界面和应用程序等的操作,调制解调处理器主要处理无线通信信号,如基带处理器。可以理解的是,上述调制解调处理器也可以不集成到处理器1110中。
本申请实施例还提供一种可读存储介质,所述可读存储介质上存储有程序或指令,该程序或指令被处理器执行时实现上述虚拟对象生成方法实施例 的各个过程,或者实现上述编解码器训练方法实施例的各个过程,且能达到相同的技术效果,为避免重复,这里不再赘述。
其中,所述处理器为上述实施例中所述的电子设备中的处理器。所述可读存储介质,包括计算机可读存储介质,如计算机只读存储器(ROM)、随机存取存储器(RAM)、磁碟或者光盘等。
本申请实施例另提供了一种芯片,所述芯片包括处理器和通信接口,所述通信接口和所述处理器耦合,所述处理器用于运行程序或指令,实现上述虚拟对象生成方法实施例的各个过程,实现上述编解码器训练方法实施例的各个过程,且能达到相同的技术效果,为避免重复,这里不再赘述。
应理解,本申请实施例提到的芯片还可以称为系统级芯片、系统芯片、芯片系统或片上系统芯片等。
本申请实施例提供一种计算机程序产品,该程序产品被存储在存储介质中,该程序产品被至少一个处理器执行以实现上述虚拟对象生成方法实施例的各个过程,或者实现上述编解码器训练方法实施例的各个过程,且能达到相同的技术效果,为避免重复,这里不再赘述。
需要说明的是,在本文中,术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、方法、物品或者装置不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、方法、物品或者装置所固有的要素。在没有更多限制的情况下,由语句“包括一个……”限定的要素,并不排除在包括该要素的过程、方法、物品或者装置中还存在另外的相同要素。此外,需要指出的是,本申请实施方式中的方法和装置的范围不限按示出或讨论的顺序来执行功能,还可包括根据所涉及的功能按基本同时的方式或按相反的顺序来执行功能,例如,可以按不同于所描述的次序来执行所描述的方法,并且还可以添加、省去、或组合各种步骤。另外,参照某些示例所描述的特征可在其他示例中被组合。
通过以上的实施方式的描述,本领域的技术人员可以清楚地了解到上述实施例方法可借助软件加必需的通用硬件平台的方式来实现,当然也可以通过硬件,但很多情况下前者是更佳的实施方式。基于这样的理解,本申请的 技术方案本质上或者说对现有技术做出贡献的部分可以以计算机软件产品的形式体现出来,该计算机软件产品存储在一个存储介质(如ROM/RAM、磁碟、光盘)中,包括若干指令用以使得一台终端(可以是手机,计算机,服务器,或者网络设备等)执行本申请各个实施例所述的方法。
上面结合附图对本申请的实施例进行了描述,但是本申请并不局限于上述的具体实施方式,上述的具体实施方式仅仅是示意性的,而不是限制性的,本领域的普通技术人员在本申请的启示下,在不脱离本申请宗旨和权利要求所保护的范围情况下,还可做出很多形式,均属于本申请的保护之内。

Claims (18)

  1. 一种虚拟对象生成方法,其中,包括:
    提取目标用户对应的第一人体特征的动作姿态,得到第一动作姿态;
    确定所述第一动作姿态对应的第一特征向量和第二特征向量,所述第一特征向量基于所述第一动作姿态确定,所述第二特征向量基于所述第一特征向量确定;
    对所述第一特征向量和所述第二特征向量进行解码处理,得到第二动作姿态,所述第二动作姿态用于表征所述目标用户对应的第二人体特征;
    基于所述第二动作姿态,生成虚拟对象。
  2. 根据权利要求1所述的方法,其中,所述确定所述第一动作姿态对应的第一特征向量和第二特征向量包括:
    通过目标编码器对所述第一动作姿态进行编码处理,得到所述第一特征向量;
    根据特征向量数据库和所述第一特征向量,确定第二特征向量。
  3. 根据权利要求2所述的方法,其中,所述根据特征向量数据库和所述第一特征向量,确定第二特征向量包括:
    根据所述特征向量数据库,确定与所述第一特征向量相关联的第三特征向量,所述第三特征向量为所述特征向量数据库中与所述第一特征向量之间的向量距离最小的特征向量;
    根据所述特征向量数据库,确定所述第三特征向量相关联的第一特征向量对,所述特征向量数据库包括至少一个特征向量对;
    将所述第一特征向量对中除所述第三特征向量之外的一个特征向量,确定为所述第二特征向量。
  4. 根据权利要求1所述的方法,其中,所述对所述第一特征向量和所述第二特征向量进行解码处理包括:
    将所述第一特征向量和所述第二特征向量组合成第二特征向量对;
    通过目标解码器,对所述第二特征向量对进行解码处理。
  5. 根据权利要求1所述的方法,其中,所述提取目标用户对应的第一人 体特征的动作姿态,得到第一动作姿态包括:
    获取目标图像,所述目标图像包括所述目标用户对应的第一人体特征;
    对所述第一人体特征进行动作姿态提取,得到所述第一动作姿态。
  6. 一种编解码器训练方法,应用于如权利要求1-5中任一项所述的方法,其中,所述编解码器训练方法包括:
    将训练数据输入至待训练的编码器,生成目标特征向量对,所述训练数据包括至少一个第三动作姿态;
    将所述目标特征向量对输入至待训练的解码器,生成第四动作姿态;
    基于所述第三动作姿态和所述第四动作姿态,对所述待训练的编码器和所述待训练的解码器进行迭代训练,得到目标编码器和目标解码器。
  7. 根据权利要求6所述的方法,其中,所述将训练数据输入至待训练的编码器之前,所述方法还包括:
    获取训练图像集,所述训练图像集包括至少一个训练图像,所述训练图像用于表征第二人体特征;
    对所述至少一个训练图像进行动作姿态提取,得到所述训练数据。
  8. 一种虚拟对象生成装置,其中,包括:
    提取模块,用于提取目标用户对应的第一人体特征的动作姿态,得到第一动作姿态;
    确定模块,用于确定所述第一动作姿态对应的第一特征向量和第二特征向量,所述第一特征向量基于所述第一动作姿态确定,所述第二特征向量基于所述第一特征向量确定;
    处理模块,用于对所述第一特征向量和所述第二特征向量进行解码处理,得到第二动作姿态,所述第二动作姿态用于表征所述目标用户对应的第二人体特征;
    生成模块,用于基于所述第二动作姿态,生成虚拟对象。
  9. 根据权利要求8所述的装置,其中,所述确定模块,具体用于:
    通过目标编码器对所述第一动作姿态进行编码处理,得到所述第一特征向量;
    根据特征向量数据库和所述第一特征向量,确定第二特征向量。
  10. 根据权利要求9所述的装置,其中,所述确定模块,还具体用于:
    根据所述特征向量数据库,确定与所述第一特征向量相关联的第三特征向量,所述第三特征向量为所述特征向量数据库中与所述第一特征向量之间的向量距离最小的特征向量;
    根据所述特征向量数据库,确定所述第三特征向量相关联的第一特征向量对,所述特征向量数据库包括至少一个特征向量对;
    将所述第一特征向量对中除所述第三特征向量之外的一个特征向量,确定为所述第二特征向量。
  11. 根据权利要求8所述的装置,其中,所述处理模块,具体用于:
    将所述第一特征向量和所述第二特征向量组合成第二特征向量对;
    通过目标解码器,对所述第二特征向量对进行解码处理。
  12. 根据权利要求8所述的装置,其中,所述提取模块,具体用于:
    获取目标图像,所述目标图像包括所述目标用户对应的第一人体特征;
    对所述第一人体特征进行动作姿态提取,得到所述第一动作姿态。
  13. 一种编解码器训练装置,其中,应用于如权利要求8-12中任一项所述的装置,所述编解码器训练装置包括:
    第一生成模块,用于将训练数据输入至待训练的编码器,生成目标特征向量对,所述训练数据包括至少一个第三动作姿态;
    第二生成模块,用于将所述目标特征向量对输入至待训练的解码器,生成第四动作姿态;
    训练模块,用于基于所述第三动作姿态和所述第四动作姿态,对所述待训练的编码器和所述待训练的解码器进行迭代训练,得到目标编码器和目标解码器。
  14. 根据权利要求13所述的装置,其中,所述装置还包括:
    获取模块,用于获取训练图像集,所述训练图像集包括至少一个训练图像,所述训练图像用于表征第二人体特征;
    提取模块,用于对所述至少一个训练图像进行动作姿态提取,得到所述训练数据。
  15. 一种电子设备,包括处理器,存储器及存储在所述存储器上并可在所 述处理器上运行的程序或指令,其中,所述程序或指令被所述处理器执行时实现如权利要求1-5任一项所述的虚拟对象生成方法的步骤,或者实现如权利要求6-7任一项所述的编解码器训练方法的步骤。
  16. 一种可读存储介质,所述可读存储介质上存储程序或指令,其中,所述程序或指令被处理器执行时实现如权利要求1-5任一项所述的虚拟对象生成方法的步骤,或者实现如权利要求6-7任一项所述的编解码器训练方法的步骤。
  17. 一种芯片,包括处理器和通信接口,其中,所述通信接口和所述处理器耦合,所述处理器用于运行程序或指令,实现如权利要求1-5任一项所述的虚拟对象生成方法的步骤,或者实现如权利要求6-7任一项所述的编解码器训练方法的步骤。
  18. 一种计算机程序产品,其中,所述计算机程序产品被存储在非易失的存储介质中,所述计算机程序产品被至少一个处理器执行时实现如权利要求1-5任一项所述的虚拟对象生成方法的步骤,或者实现如权利要求6-7任一项所述的编解码器训练方法的步骤。
PCT/CN2022/118712 2022-09-14 2022-09-14 虚拟对象生成方法、编解码器训练方法及其装置 WO2024055194A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/CN2022/118712 WO2024055194A1 (zh) 2022-09-14 2022-09-14 虚拟对象生成方法、编解码器训练方法及其装置

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2022/118712 WO2024055194A1 (zh) 2022-09-14 2022-09-14 虚拟对象生成方法、编解码器训练方法及其装置

Publications (1)

Publication Number Publication Date
WO2024055194A1 true WO2024055194A1 (zh) 2024-03-21

Family

ID=90274068

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/118712 WO2024055194A1 (zh) 2022-09-14 2022-09-14 虚拟对象生成方法、编解码器训练方法及其装置

Country Status (1)

Country Link
WO (1) WO2024055194A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118097082A (zh) * 2024-04-26 2024-05-28 腾讯科技(深圳)有限公司 虚拟对象图像生成方法、装置、计算机设备和存储介质

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180181802A1 (en) * 2016-12-28 2018-06-28 Adobe Systems Incorporated Recognizing combinations of body shape, pose, and clothing in three-dimensional input images
CN111339870A (zh) * 2020-02-18 2020-06-26 东南大学 一种针对物体遮挡场景的人体形状和姿态估计方法
CN112232221A (zh) * 2020-10-19 2021-01-15 戴姆勒股份公司 用于人物图像处理的方法、系统和程序载体
WO2021219835A1 (en) * 2020-04-30 2021-11-04 Siemens Aktiengesellschaft Pose estimation method and apparatus
CN114782661A (zh) * 2022-06-22 2022-07-22 阿里巴巴达摩院(杭州)科技有限公司 下半身姿态预测模型的训练方法及装置
CN114937115A (zh) * 2021-07-29 2022-08-23 腾讯科技(深圳)有限公司 图像处理方法、人脸更换模型处理方法、装置和电子设备

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20180181802A1 (en) * 2016-12-28 2018-06-28 Adobe Systems Incorporated Recognizing combinations of body shape, pose, and clothing in three-dimensional input images
CN111339870A (zh) * 2020-02-18 2020-06-26 东南大学 一种针对物体遮挡场景的人体形状和姿态估计方法
WO2021219835A1 (en) * 2020-04-30 2021-11-04 Siemens Aktiengesellschaft Pose estimation method and apparatus
CN112232221A (zh) * 2020-10-19 2021-01-15 戴姆勒股份公司 用于人物图像处理的方法、系统和程序载体
CN114937115A (zh) * 2021-07-29 2022-08-23 腾讯科技(深圳)有限公司 图像处理方法、人脸更换模型处理方法、装置和电子设备
CN114782661A (zh) * 2022-06-22 2022-07-22 阿里巴巴达摩院(杭州)科技有限公司 下半身姿态预测模型的训练方法及装置

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118097082A (zh) * 2024-04-26 2024-05-28 腾讯科技(深圳)有限公司 虚拟对象图像生成方法、装置、计算机设备和存储介质

Similar Documents

Publication Publication Date Title
WO2020140798A1 (zh) 手势识别方法、装置、电子设备及存储介质
EP3129871B1 (en) Generating a screenshot
JP6986187B2 (ja) 人物識別方法、装置、電子デバイス、記憶媒体、及びプログラム
US11853895B2 (en) Mirror loss neural networks
CN113420719A (zh) 生成动作捕捉数据的方法、装置、电子设备以及存储介质
US10748579B2 (en) Employing live camera feeds to edit facial expressions
US11562734B2 (en) Systems and methods for automatic speech recognition based on graphics processing units
WO2022142298A1 (zh) 关键点检测方法及装置、电子设备和存储介质
WO2022100690A1 (zh) 动物脸风格图像生成方法、模型训练方法、装置和设备
CN107277643A (zh) 弹幕内容的发送方法及客户端
WO2023202570A1 (zh) 图像处理方法和处理装置、电子设备和可读存储介质
WO2024055194A1 (zh) 虚拟对象生成方法、编解码器训练方法及其装置
JP2023538687A (ja) 仮想キーボードに基づくテキスト入力方法及び装置
CN112528978B (zh) 人脸关键点的检测方法、装置、电子设备及存储介质
WO2024067512A1 (zh) 视频密集预测方法及其装置
WO2023246715A1 (zh) 目标应用的网络连接控制方法、装置和电子设备
CN112714337A (zh) 视频处理方法、装置、电子设备和存储介质
WO2023093669A1 (zh) 视频拍摄方法、装置、电子设备及存储介质
CN115665361A (zh) 虚拟环境中的视频融合方法和在线视频会议通信方法
CN113542257A (zh) 视频处理方法、视频处理装置、电子设备和存储介质
US20160315988A1 (en) Method and apparatus for collaborative environment sharing
US11899846B2 (en) Customizable gesture commands
JP7513019B2 (ja) 画像処理装置および方法、並びに、プログラム
CN113658213B (zh) 形象呈现方法、相关装置及计算机程序产品
WO2024007135A1 (zh) 图像处理方法、装置、终端设备、电子设备及存储介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22958388

Country of ref document: EP

Kind code of ref document: A1