CN116993906A

CN116993906A - Method, apparatus, device and computer readable medium for controlling virtual object

Info

Publication number: CN116993906A
Application number: CN202310847426.XA
Authority: CN
Inventors: 何涛
Original assignee: Shanghai Bilibili Technology Co Ltd
Current assignee: Shanghai Bilibili Technology Co Ltd
Priority date: 2023-07-11
Filing date: 2023-07-11
Publication date: 2023-11-03

Abstract

The application provides a method, a device, equipment and a computer readable medium for controlling a virtual object, wherein the application obtains a target image, and the target image at least comprises a physical hand object; processing the target image by using a gesture estimation model to generate a first gesture parameter and a second gesture parameter of the physical hand object, wherein the first gesture parameter comprises a parameterization coefficient regressing for a preset hand model, the second gesture parameter comprises a non-parameterization coefficient regressing based on three-dimensional estimation, and the first gesture parameter and the second gesture parameter meet the requirement of a preset regular constraint; and controlling the virtual hand object corresponding to the physical hand object using at least one of the first gesture parameter and the second gesture parameter. Therefore, the gesture parameters can be estimated more accurately, the obtained gesture parameters can be utilized to control the virtual object, and the control quality is improved.

Description

Method, apparatus, device and computer readable medium for controlling virtual object

Technical Field

The present application relates to the field of computer technologies, and in particular, to a method and apparatus for controlling a virtual object, an electronic device, and a computer readable medium.

Background

In recent years, with the rapid development of computer software and hardware technology and the increase of animation production requirements, motion capture technology is gradually mature. The motion capture can correspondingly control the virtual object in the virtual scene based on the motion of the user in the physical scene, and brings different interaction experience in the virtual scene for the user.

With the rise of technologies and applications such as virtual reality and meta-universe, there is an increasing demand for motion capture. Through the three-dimensional motion capture technology, the efficiency can be greatly improved, and the fluent combination of the real world and the virtual world is increased. Thus, how to improve the quality of motion capture, and control quality for virtual objects is of great concern and urgent need.

Disclosure of Invention

Aspects of the present application provide a method, apparatus, and computer-readable storage medium for controlling a virtual object, which can more accurately estimate gesture parameters, and control the virtual object using the obtained gesture parameters, thereby improving control quality.

In one aspect of the present application, there is provided a method of controlling a virtual object, including: acquiring a target image, wherein the target image at least comprises a physical hand object; processing the target image by using a gesture estimation model to generate a first gesture parameter and a second gesture parameter of the physical hand object, wherein the first gesture parameter comprises a parameterization coefficient regressing for a preset hand model, the second gesture parameter comprises a non-parameterization coefficient regressing based on three-dimensional estimation, and the first gesture parameter and the second gesture parameter meet the requirement of a preset regular constraint; and controlling the virtual hand object corresponding to the physical hand object using at least one of the first gesture parameter and the second gesture parameter.

In another aspect of the present application, there is provided an apparatus for controlling a virtual object, including: the acquisition module is configured to acquire a target image, wherein the target image at least comprises a physical hand object; the generating module is configured to process the target image by utilizing the gesture estimation model, and generate a first gesture parameter and a second gesture parameter of the physical hand object, wherein the first gesture parameter comprises a parameterization coefficient regressing for a preset hand model, the second gesture parameter comprises a non-parameterization coefficient regressing based on three-dimensional estimation, and the first gesture parameter and the second gesture parameter meet the requirement of a preset regular constraint; and a control module configured to control the virtual hand object corresponding to the physical hand object using at least one of the first gesture parameter and the second gesture parameter.

In another aspect of the present application, there is provided an electronic apparatus including: at least one processor; and a memory communicatively coupled to the at least one processor; the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of controlling a virtual object described above.

In another aspect of the present application, a computer readable storage medium having stored thereon computer program instructions executable by a processor to implement the method of controlling a virtual object described above is provided.

In the scheme provided by the embodiment of the application, in the gesture estimation process of the physical hand object, the gesture estimation model can be utilized to generate the parameterized coefficient and the non-parameterized coefficient which meet the preset regular constraint so as to realize the control of the virtual hand object by utilizing at least one of the parameterized coefficient and the non-parameterized coefficient. Therefore, the gesture parameters can be estimated more accurately, the obtained gesture parameters can be utilized to control the virtual object, and the control quality is improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the following description will briefly explain the drawings used in the embodiments or the description of the prior art, and it is obvious that the drawings in the following description are some embodiments of the present application, and other drawings can be obtained according to these drawings without inventive effort to a person skilled in the art.

Other features, objects and advantages of the present application will become more apparent upon reading of the detailed description of non-limiting embodiments, made with reference to the accompanying drawings in which:

FIG. 1 is a schematic diagram of a flow of controlling a virtual object according to an embodiment of the present application;

FIG. 2 is a schematic diagram of a gesture estimation model according to an embodiment of the present application;

FIG. 3 is a schematic diagram of an apparatus for controlling a virtual object according to an embodiment of the present application;

fig. 4 is a schematic diagram of the structure of an apparatus suitable for implementing aspects of embodiments of the application, in which like or similar reference numerals refer to like or similar parts.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the embodiments of the present application more apparent, the technical solutions of the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present application, and it is apparent that the described embodiments are some embodiments of the present application, but not all embodiments of the present application. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.

In one exemplary configuration of the application, the terminal, the devices of the services network each include one or more processors (CPUs), input/output interfaces, network interfaces, and memory.

The memory may include volatile memory in a computer-readable medium, random Access Memory (RAM) and/or nonvolatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of computer-readable media.

Computer-readable media include both permanent and non-permanent, removable and non-removable media, and information storage may be implemented by any method or technology. The information may be computer program instructions, data structures, modules of the program, or other data. Examples of storage media for a computer include, but are not limited to, phase change memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape storage or other magnetic storage devices, or any other non-transmission medium which can be used to store information that can be accessed by a computing device.

As described above, with the advent of technologies and applications such as virtual reality, meta-universe, etc., there is an increasing demand for motion capture. How to improve the quality of motion capture and the quality of control over virtual objects is of great concern and urgent need.

In some aspects, motion capture may be achieved by optical motion capture (Optical Motion Capture), inertial motion capture (Inertial Motion Capture), hybrid motion capture (Hybrid Motion Capture).

Optical dynamic capturing technology: the technique captures specific marks on an actor using a plurality of cameras, and then calculates a motion trajectory of the character. This technique typically needs to be performed in a specific shooting location and requires the use of specialized cameras and tracking software.

Inertial dynamic catching technology: this technique uses sensors and gyroscopes to capture the movements of the actor and then transmits the data to a computer for processing. Compared with the optical dynamic capturing technology, the optical dynamic capturing technology does not need a specific shooting place and can be used indoors and outdoors.

The mixed dynamic catching technology comprises the following steps: the technology is a combination of optical dynamic capturing technology and inertial dynamic capturing technology. In the hybrid dynamic capturing technology, a sensor and a specific mark are simultaneously mounted on an actor, and movements of the actor are captured by a plurality of cameras and sensors, so that animation of a virtual character is generated.

However, in this manner, additional hardware support is usually required, so that the cost is too high for the common user, the use environment is limited, and performance bottlenecks may exist, so that the user experience is affected.

The embodiment of the application provides a method for controlling a virtual object, which comprises the steps of acquiring a target image, wherein the target image at least comprises a physical hand object; processing the target image by using a gesture estimation model to generate a first gesture parameter and a second gesture parameter of the physical hand object, wherein the first gesture parameter comprises a parameterization coefficient regressing for a preset hand model, the second gesture parameter comprises a non-parameterization coefficient regressing based on three-dimensional estimation, and the first gesture parameter and the second gesture parameter meet the requirement of a preset regular constraint; and controlling the virtual hand object corresponding to the physical hand object using at least one of the first gesture parameter and the second gesture parameter. Therefore, the gesture parameters can be estimated more accurately, the obtained gesture parameters can be utilized to control the virtual object, and the control quality is improved.

In an actual scenario, the execution body of the method may be a user device, or a device formed by integrating the user device and a network device through a network, or may also be an application running on the device, where the user device includes, but is not limited to, various terminal devices such as a computer, a mobile phone, a tablet computer, a smart watch, a bracelet, and the network device includes, but is not limited to, a network host, a single network server, a plurality of network server sets, or a computer set based on cloud computing, and the network device may be implemented, for example, to implement a part of processing functions when setting an alarm clock. Here, the Cloud is composed of a large number of hosts or web servers based on Cloud Computing (Cloud Computing), which is a kind of distributed Computing, one virtual computer composed of a group of loosely coupled computer sets.

The procedure of the method of controlling a virtual object provided by the present application will be described in detail. Fig. 1 shows a flow 100 of a method for controlling a virtual object according to an embodiment of the present application. The process 100 includes at least the following processing steps:

step S101, a target image is acquired.

In an embodiment of the application, an image is acquired that includes at least a physical hand object in a target image, e.g., the target image may be acquired by a user for controlling a virtual object (e.g., a virtual hand object). For example, images captured by a camera of a terminal device used by a user.

In some embodiments, further comprising: determining a target position of a physical hand object in a target image; and normalizing the target image based on the target position to obtain a normalized image of the target image. Specifically, after the target head portrait is acquired, normalization processing may be performed on the image to obtain a normalized image. Thereby, the garbage in the image is reduced in order to better identify the physical hand object in the target image. In some embodiments, before the target image is normalized, the target position of the physical hand object in the target image may be determined by means of object detection and gesture detection, for example, and the target image is normalized based on the target position, so as to promote the processing instruction.

Step S102, processing the target image by using the gesture estimation model to generate a first gesture parameter and a second gesture parameter of the physical hand object.

In an embodiment of the present application, after the target image is obtained based on the above step S101, the target image may be processed using a pose estimation model to generate the first pose parameter and the second pose parameter of the physical hand object. The first gesture parameters include parameterized coefficients regressive for a preset hand model. For ease of understanding, the method, manner of deriving the first pose parameter may also be referred to as a parameterization method. Based on the parameterization method, parameterization coefficients may be normalized back and forth based on a parameterization model (e.g., MANO) of a preset hand model. Parameterized coefficients (also referred to as parameterized models) provide morphological and kinematic prior constraints of gestures, greatly reducing search space, making gesture pose estimation easier. To enhance the processing effect, the first gesture parameter may be returned back and forth based on gesture keypoint coordinates (e.g., three-dimensional coordinates). In some embodiments, the first gesture parameter is a three-dimensional coordinate of a skeletal point of the physical hand object, and thus, the three-dimensional coordinate of the skeletal point may be used to restore, predict the hand motion in the three-dimensional state, and analyze the change of the skeletal point based on the change of the three-dimensional coordinate to continuously and dynamically restore the hand gesture. In some embodiments, for some special key points (such as bone points), in order to avoid difficult recognition caused by shielding or the like, or lower processing quality for the three-dimensional coordinates caused by insufficient processing capacity, the problem can be solved and the processing effect can be improved by transforming the three-dimensional coordinates into two-dimensional coordinates. For example, three-dimensional coordinates may be verified based on two-dimensional coordinates, and so on. Further, the second pose parameters include non-parameterized coefficients based on three-dimensional predictive regression, and similarly, the method, manner for deriving the second pose parameters may also be referred to as a non-parameterized method. The non-parameterized method can have larger freedom degree, the space which can be fitted by the algorithm is larger, and the non-parameterized coefficient can be obtained as the second gesture parameter by directly regressing the gesture 3 Dresh or a key point. In some embodiments, for non-parametric rendering, a second pose parameter may also be determined based on the analysis of the keypoints, e.g., the second pose parameter is a three-dimensional coordinate and/or a two-dimensional coordinate of skeletal points of the physical hand object, to continuously and dynamically restore the hand pose.

Further, the pose estimation model may process the acquired target image to generate a first pose parameter and a second pose parameter. In an embodiment of the present disclosure, a regular operation result (e.g., an error of an absolute value or an error of an absolute value square) between a first pose parameter and a second pose parameter that is finally generated by the pose estimation model meets a requirement of a preset regular constraint (e.g., the regular operation result meets a requirement of a preset threshold, falls within a preset numerical range).

In some embodiments, the pose estimation model may be derived based on prior training. For example, in the initial pose estimation model, shape parameters and pose parameters (e.g., 24×3, which may be used to characterize the three-dimensional rotation of 24 joints) including preset dimensions (e.g., 10 dimensions) may be configured for the parameterized model. Thus, a three-dimensional gesture model and corresponding three-dimensional coordinates of a predetermined number (e.g., 21) of skeletal points, i.e., the first gesture parameters, can be obtained based on the three-dimensional gesture model. In addition, for the non-parameterized method, the second posture parameter can be obtained by directly predicting the three-dimensional coordinates of the 21 bone points. Further, for the initial posture pre-estimation model, a regular constraint can be applied to the three-dimensional skeleton point coordinates obtained by the parameterized model and the three-dimensional skeleton point coordinates obtained by direct regression, so that the first posture parameter and the second posture parameter output by the trained model meet the requirement of the regular constraint.

In some embodiments, in the training process for the initial gesture estimation model, besides the regular constraint, a true value supervision constraint and the like can be applied to perform training constraint so as to improve the quality of the output first gesture parameters and second parameters.

In some embodiments, to enable the pose estimation model to perform the non-parameterized and parameterized modes, respectively, the initial, trained pose estimation model may be structured based on at least a first sub-model for implementing the parameterized mode and a second sub-model for implementing the non-parameterized mode. For example, the pose estimation model may be based on a backbone model thereof, for example, after processing the target image into an image feature based on a convolutional neural network (Convolutional Neural Networks, abbreviated as CNN), the image feature is processed by using a first sub-model and a second sub-model, so as to obtain a first pose parameter output by the first sub-model and a second pose parameter output by the second sub-model. Therefore, the processing process is relatively independent, the processing quality is improved, the complexity of the model is reduced, and the construction difficulty is reduced.

In this regard, fig. 2 may be taken together, and fig. 2 is a schematic diagram of an architecture 200 of a gesture estimation model according to an embodiment of the present application. The architecture 200 of the gesture estimation model includes at least a first sub-model 240 and a second sub-model 230. In some embodiments, the first sub-model 240 and the second sub-model 230 may also be implemented based on CNNs. The target image 210 may be processed by a backbone network (e.g., convolutional neural network 220) to generate image features 221.

In some embodiments, the first sub-model 240 processes the image features into first pose parameters (e.g., three-dimensional coordinates 243 of skeletal points) in the following manner. Specifically, the image features 221 may be input to a shape estimation module (e.g., shape estimation module 241) of the first sub-model 240 and a pose estimation module (e.g., pose estimation module 242) of the first sub-model 240, respectively, to generate a shape estimation result of the physical hand object and then, based on the shape estimation result and the pose estimation result, a first pose parameter (e.g., three-dimensional coordinates 243 of the skeletal points) may be generated. For example, after determining the shape of the hand object based on the shape estimation result, the spatial position (i.e., three-dimensional coordinates) where each skeletal point is located may be determined based on the pose estimated by the pose estimation result to generate the first pose parameter. In some embodiments, the first sub-model 240 may also process the three-dimensional coordinates 243 into two-dimensional coordinates 244 for use in the event that there is a need to generate two-dimensional coordinates. Thus, based on the architecture, the target image 210 may be processed by at least the first sub-model 240 to generate a first pose parameter (e.g., three-dimensional coordinates 243).

Further, the second sub-model 230 may process the image features into second pose parameters by: inputting the image features to a heat prediction module of the second sub-model to generate predicted second pose parameters of the physical hand object; the second pose parameter is predicted using a prediction module process that offsets the second sub-model to generate the second pose parameter. In particular, referring concurrently to fig. 2, the image features 221 may be processed using the heat prediction module 231 to generate predicted second pose parameters of the physical hand object. For example, the heat prediction module 231 may process the image feature 221 using a pre-configured thermodynamic diagram to predict predicted three-dimensional coordinates of key points included in the processed image feature 221. The offset prediction module 232 may then be utilized to predict an offset of the predicted three-dimensional coordinate from the true three-dimensional coordinate to modify the predicted three-dimensional coordinate to generate the three-dimensional coordinate 233 as the second pose parameter. Similarly, two-dimensional coordinates 234 may also be derived for use based on three-dimensional coordinates 233. In addition, the bias prediction module 232 is used to correct the result of the heat prediction module 231 in the second sub-model 230. In view of the high resolution thermodynamic diagram that is computationally demanding, to improve efficiency, the heat prediction module 231 may be configured to predict the three-dimensional coordinates of the keypoints (e.g., skeletal points) using a thermodynamic diagram of low resolution (e.g., 8x 8). Therefore, not only can any continuous space coordinate be expressed, but also the advantages of thermodynamic diagram and direct regression can be utilized at the same time, and the generation efficiency and quality of the second posture parameter are further improved.

In addition, in the training of the initial posture estimation model by building the initial posture estimation model based on the above-described architecture (for example, in the training of the initial posture estimation model by taking a sample image as an input and a label such as a three-dimensional coordinate or a two-dimensional coordinate of a bone point as an output), the above-described regular constraint may be applied at least to, for example, the three-dimensional coordinate 233 or the three-dimensional coordinate 243, so that whether or not the training of the initial posture estimation model is completed may be determined by whether or not the result of the regular operation (for example, an error in the absolute value or an error in the square of the absolute value of the three-dimensional coordinate 233 or the three-dimensional coordinate 243) at least to, for example, the three-dimensional coordinate 233 or the three-dimensional coordinate 243 satisfies the requirement of the regular constraint.

Thus, the training and the obtained gesture estimation model can output the first gesture parameter and the second gesture parameter aiming at the input target image. Typically, the first and second pose parameters are nearly identical (or, alternatively, the difference between the two may fall within a desired threshold range). By the aid of the method, morphological constraint of hands can be maintained through the parameter gesture model branches, flexibility of three-dimensional direct prediction branches can be utilized, and point location accuracy is improved. Thereby, the numerical quality of the first and second gesture parameters may be simultaneously promoted, such that at least one of the first and second gesture parameters may be selected according to actual requirements.

In addition, a part of data (about 50W frames) with three-dimensional labels, which are collected by a dynamic capture room, and some data with only two-dimensional key points (about 20W frames) marked can be used for training an initial posture estimation model so as to improve the training effect on the posture estimation model.

Step S103, controlling the virtual hand object corresponding to the physical hand object by using at least one of the first gesture parameter and the second gesture parameter.

In the embodiment of the present application, in the case of generating the first posture parameter and the second posture parameter of the physical hand object based on the above-described posture estimation model processing target image, the virtual hand object may be controlled using the first posture parameter and/or the second posture parameter. In general, the first gesture parameter and the second gesture parameter may be arbitrarily selected to achieve control of the virtual hand object without explicit directivity requirements or constraints. For some scenes with explicit pointing requirements, for example, in case of controlling a virtual hand object based on a parameterized gesture model, a first gesture parameter may be employed, which is still significantly higher in quality than parameters generated based on, for example, parameterization alone, because it is constrained by a second gesture parameter at the time of generation. Thereby, it is made possible to select at least one of the first gesture parameter and the second gesture parameter according to the actual requirements, and the gesture prediction of the parameterized model and the gesture prediction of the neural network can be combined, mutually supervised and constrained. The first gesture parameter and the second gesture parameter obtained in this way can better control the virtual hand object (for example, the virtual cartoon object), and provide better restoring effect. By combining the parameterization method and the non-parameterization method, the accuracy of gesture attitude estimation is improved under the condition that gesture morphology and kinematic constraint are met. By properly combining parameterization and non-parameterization, the advantages of the parameterization and the non-parameterization are utilized, and the effect of gesture attitude estimation is greatly improved.

Then, the method for controlling the virtual object acquires the target image, wherein the target image at least comprises a physical hand object; processing the target image by using a gesture estimation model to generate a first gesture parameter and a second gesture parameter of the physical hand object, wherein the first gesture parameter comprises a parameterization coefficient regressing for a preset hand model, the second gesture parameter comprises a non-parameterization coefficient regressing based on three-dimensional estimation, and the first gesture parameter and the second gesture parameter meet the requirement of a preset regular constraint; and controlling the virtual hand object corresponding to the physical hand object using at least one of the first gesture parameter and the second gesture parameter. Therefore, the gesture parameters can be estimated more accurately, the obtained gesture parameters can be utilized to control the virtual object, and the control quality is improved.

The embodiment of the application also provides a device for controlling the virtual object, and the structure of the device is shown as a device 300 in fig. 3. The apparatus 300 includes: an acquisition module 310 configured to acquire a target image, where the target image includes at least a physical hand object; the generating module 320 is configured to process the target image by using the gesture estimation model, and generate a first gesture parameter and a second gesture parameter of the physical hand object, where the first gesture parameter includes a parameterized coefficient for regression of the preset hand model, the second gesture parameter includes a non-parameterized coefficient based on three-dimensional estimation regression, and the first gesture parameter and the second gesture parameter satisfy a requirement of a preset regular constraint; and a control module 330 configured to control the virtual hand object corresponding to the physical hand object using at least one of the first gesture parameter and the second gesture parameter.

In some embodiments, the apparatus 300 further comprises: the normalization module is configured to determine a target position of the physical hand object in the target image; normalizing the target image based on the target position to obtain a normalized image of the target image; and processing the target image using the pose estimation model, comprising: and processing the normalized image by using the gesture estimation model.

In some embodiments, the pose estimation model includes a backbone model for processing the target image as an image feature, a first sub-model for processing the image feature as a first pose parameter, and a second sub-model for processing the image feature as a second pose parameter.

In some embodiments, the first sub-model processes the image features into first pose parameters by: inputting the image features to a shape estimation module to generate a shape estimation result of the physical hand object, and inputting the image features to a posture estimation module to generate a posture estimation result of the physical hand object; a first pose parameter is generated based on the shape estimation result and the pose estimation result.

In some embodiments, the second sub-model processes the image features into second pose parameters by: inputting the image features to a heat prediction module to generate predicted second pose parameters of the physical hand object; the second pose parameter is predicted using the offset prediction module process to generate the second pose parameter.

In some embodiments, the first gesture parameter is a three-dimensional coordinate and/or a two-dimensional coordinate of a skeletal point of the physical hand object.

In some embodiments, the second pose parameter is a three-dimensional coordinate and/or a two-dimensional coordinate of a skeletal point of the physical hand object.

In addition, based on the same inventive concept, an electronic device is provided in the embodiment of the present application, and the method corresponding to the electronic device may be the method for controlling the virtual object in the foregoing embodiment, and the principle of solving the problem is similar to that of the method. The electronic equipment provided by the embodiment of the application comprises: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the methods and/or aspects of the various embodiments of the application described above.

The electronic device may be a user device, or a device formed by integrating the user device and a network device through a network, or may also be an application running on the device, where the user device includes, but is not limited to, various terminal devices such as a computer, a mobile phone, a tablet computer, a smart watch, a bracelet, and the network device includes, but is not limited to, a network host, a single network server, a plurality of network server sets, or a computer set based on cloud computing, and the network device may be implemented, for example, to implement a part of processing functions when setting an alarm clock. Here, the Cloud is composed of a large number of hosts or web servers based on Cloud Computing (Cloud Computing), which is a kind of distributed Computing, one virtual computer composed of a group of loosely coupled computer sets.

Fig. 4 shows a structure of a device 400 suitable for implementing the method and/or technical solution in an embodiment of the present application, the device 400 comprising a central processing unit (CPU, central Processing Unit) 401, which may perform various suitable actions and processes according to a program stored in a Read Only Memory (ROM) 402 or a program loaded from a storage section 408 into a random access Memory (RAM, random Access Memory) 403. In the RAM 403, various programs and data required for the system operation are also stored. The CPU 401, ROM 402, and RAM 403 are connected to each other by a bus 404. An Input/Output (I/O) interface 405 is also connected to bus 404.

The following components are connected to the I/O interface 405: an input section 406 including a keyboard, a mouse, a touch panel, a microphone, an infrared sensor, and the like; an output portion 407 including a display such as a Cathode Ray Tube (CRT), a liquid crystal display (LCD, liquid Crystal Display), an LED display, an OLED display, and a speaker; a storage portion 408 comprising one or more computer-readable media of hard disk, optical disk, magnetic disk, semiconductor memory, etc.; and a communication section 409 including a network interface card such as a LAN (local area network ) card, a modem, or the like. The communication section 409 performs communication processing via a network such as the internet.

In particular, the methods and/or embodiments of the present application may be implemented as computer software programs. For example, embodiments of the present disclosure include a computer program product comprising a computer program embodied on a computer readable medium, the computer program comprising program code for performing the method shown in the flow chart. The above-described functions defined in the method of the present application are performed when the computer program is executed by a Central Processing Unit (CPU) 401.

Another embodiment of the present application also provides a computer readable storage medium having stored thereon computer program instructions executable by a processor to implement the method and/or the technical solution of any one or more of the embodiments of the present application described above.

In particular, the present embodiments may employ any combination of one or more computer-readable media. The computer readable medium may be a computer readable signal medium or a computer readable storage medium. The computer readable storage medium can be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or a combination of any of the foregoing. More specific examples (a non-exhaustive list) of the computer-readable storage medium would include the following: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.

The computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, either in baseband or as part of a carrier wave. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.

Program code embodied on a computer readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.

Computer program code for carrying out operations of the present application may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, smalltalk, C ++ and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computer (for example, through the Internet using an Internet service provider).

The flowchart or block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of devices, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

It will be clear to those skilled in the art that, for convenience and brevity of description, specific working procedures of the above-described systems, apparatuses and units may refer to corresponding procedures in the foregoing method embodiments, which are not repeated herein.

In the several embodiments provided in the present application, it should be understood that the disclosed systems, devices, and methods may be implemented in other manners. For example, the apparatus embodiments described above are merely illustrative, e.g., the partitioning of elements is merely a logical functional partitioning, and there may be additional partitioning in actual implementation, e.g., multiple elements or page components may be combined or integrated into another system, or some features may be omitted, or not implemented. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical or other form.

The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed over a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in the embodiments of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in hardware plus software functional units.

The integrated units implemented in the form of software functional units described above may be stored in a computer readable storage medium. The software functional units described above are stored in a storage medium and include instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) or a processor (processor) to perform part of the steps of the methods described in the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present application, and are not limiting; although the application has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present application.

Furthermore, it is evident that the word "comprising" does not exclude other elements or steps, and that the singular does not exclude a plurality. A plurality of units or means recited in the apparatus claims can also be implemented by means of one unit or means in software or hardware. The terms first, second, etc. are used to denote a name, but not any particular order.

Claims

1. A method of controlling a virtual object, comprising:

acquiring a target image, wherein the target image at least comprises a physical hand object;

processing the target image by using a gesture estimation model, and generating a first gesture parameter and a second gesture parameter of the physical hand object, wherein the first gesture parameter comprises a parameterization coefficient regressing for a preset hand model, the second gesture parameter comprises a non-parameterization coefficient based on three-dimensional estimation regression, and the first gesture parameter and the second gesture parameter meet the requirement of a preset regular constraint; and

and controlling a virtual hand object corresponding to the physical hand object by using at least one of the first gesture parameter and the second gesture parameter.

2. The method of claim 1, further comprising:

determining a target position of the physical hand object in the target image;

normalizing the target image based on the target position to obtain a normalized image of the target image; and wherein processing the target image using a pose estimation model comprises:

and processing the normalized image by using the gesture estimation model.

3. The method of claim 1, wherein the pose estimation model comprises a backbone model for processing the target image as an image feature, a first sub-model for processing the image feature as the first pose parameter, and a second sub-model for processing the image feature as the second pose parameter.

4. A method according to claim 3, wherein the first sub-model processes the image features into the first pose parameters by:

inputting the image features to a shape estimation module of the first sub-model to generate a shape estimation result of the physical hand object, and

inputting the image features to a gesture estimation module of the first sub-model to generate a gesture estimation result of the physical hand object;

the first pose parameter is generated based on the shape estimation result and the pose estimation result.

5. A method according to claim 3, wherein the second sub-model processes the image features into the second pose parameters by:

inputting the image features to a heat prediction module of the second sub-model to generate predicted second pose parameters of the physical hand object;

and processing the predicted second gesture parameters by using an offset prediction module of the second sub-model to generate the second gesture parameters.

6. The method according to any of claims 1-5, wherein the first gesture parameter is a three-dimensional coordinate and/or a two-dimensional coordinate of a skeletal point of the physical hand object.

7. The method of any of claims 1-5, wherein the second gesture parameter is a three-dimensional coordinate and/or a two-dimensional coordinate of a skeletal point of the physical hand object.

8. An apparatus for controlling a virtual object, the apparatus comprising:

the acquisition module is configured to acquire a target image, wherein the target image at least comprises a physical hand object;

the generating module is configured to process the target image by using a gesture estimation model, and generate a first gesture parameter and a second gesture parameter of the physical hand object, wherein the first gesture parameter comprises a parameterization coefficient regressing for a preset hand model, the second gesture parameter comprises a non-parameterization coefficient based on three-dimensional estimation regression, and the first gesture parameter and the second gesture parameter meet the requirement of a preset regular constraint; and

a control module configured to control a virtual hand object corresponding to the physical hand object using at least one of the first gesture parameter and the second gesture parameter.

9. An electronic device, the electronic device comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1 to 7.

10. A computer readable medium having stored thereon computer program instructions executable by a processor to implement the method of any of claims 1 to 7.