CN116156141A

CN116156141A - Volume video playing method and device, electronic equipment and storage medium

Info

Publication number: CN116156141A
Application number: CN202211610309.3A
Authority: CN
Inventors: 张煜; 魏汉青; 孙伟; 邵志兢
Original assignee: Zhuhai Prometheus Vision Technology Co ltd
Current assignee: Zhuhai Prometheus Vision Technology Co ltd
Priority date: 2022-12-14
Filing date: 2022-12-14
Publication date: 2023-05-23

Abstract

The embodiment of the application discloses a method, a device, electronic equipment and a storage medium for playing a volume video. The method comprises the following steps: the electronic equipment identifies limb actions of the virtual object in the volume video; determining at least one characteristic point on a limb corresponding to the virtual object according to the limb action of the virtual object; determining corresponding associated areas near the limbs according to at least one characteristic point, and setting at least one virtual element in each associated area; when the volumetric video is played, limb movements of the virtual object are presented and at least one virtual element is presented in the associated region. In the embodiment of the application, when the volume video is played, the virtual elements are added in the area corresponding to the virtual object, so that the interestingness of the volume video is enhanced.

Description

Volume video playing method and device, electronic equipment and storage medium

Technical Field

The present invention relates to the field of computer technologies, and in particular, to a method and apparatus for playing a volumetric video, an electronic device, and a storage medium.

Background

In the prior art, the virtual character in the volume video is single, namely the virtual character is the character recorded during recording. For example, when the volume video is recorded, the recorded person is a dancer, and when the volume video is played, only the virtual person image of the dancer can be displayed.

Therefore, the virtual character image in the volume video in the prior art is too single, and the virtual character image lacks interestingness in playing.

Disclosure of Invention

The embodiment of the application provides a method and device for playing a volume video, electronic equipment and a storage medium. According to the volume video playing method, virtual elements are added to the area corresponding to the virtual object, so that the interestingness of playing the volume video is enhanced.

In a first aspect, an embodiment of the present application provides a method for playing a volumetric video, including:

identifying limb actions of a virtual object in the volumetric video;

determining at least one characteristic point on a limb corresponding to the virtual object according to the limb action of the virtual object;

determining corresponding associated areas near the limbs according to at least one characteristic point, and setting at least one virtual element in each associated area;

when the volumetric video is played, limb movements of the virtual object are presented and at least one virtual element is presented in the associated region.

In a second aspect, embodiments of the present application provide a volumetric video playback device, including:

the identifying module is used for identifying limb actions of the virtual object in the volume video;

the determining module is used for determining at least one characteristic point on the limb corresponding to the virtual object according to the limb action of the virtual object;

The setting module is used for determining corresponding associated areas near the limbs according to at least one characteristic point and setting at least one virtual element in each associated area;

and the display module is used for displaying limb actions of the virtual object and at least one virtual element in the associated area when the volume video is played.

In a third aspect, an embodiment of the present application provides an electronic device, including: a memory storing executable program code, a processor coupled to the memory; the processor invokes the executable program code stored in the memory to perform the steps in the method for playing a volumetric video provided in the embodiments of the present application.

In a fourth aspect, embodiments of the present application provide a storage medium storing a plurality of instructions adapted to be loaded by a processor to perform steps in a method for playing a volumetric video provided by embodiments of the present application.

In the embodiment of the application, the electronic equipment identifies the limb action of the virtual object in the volume video; determining at least one characteristic point on a limb corresponding to the virtual object according to the limb action of the virtual object; determining corresponding associated areas near the limbs according to at least one characteristic point, and setting at least one virtual element in each associated area; when the volumetric video is played, limb movements of the virtual object are presented and at least one virtual element is presented in the associated region. In the embodiment of the application, when the volume video is played, the virtual elements are added in the area corresponding to the virtual object, so that the interestingness of the volume video is enhanced.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the description of the embodiments will be briefly introduced below, it being obvious that the drawings in the following description are only some embodiments of the present application, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 is a schematic view of a shooting system according to an embodiment of the present application.

Fig. 2 is a schematic flow chart of a method for playing a video volume according to an embodiment of the present application.

Fig. 3 is a second flow chart of the method for playing a video volume according to the embodiment of the present application.

Fig. 4 is a schematic view of a scenario of a volumetric video playing method according to an embodiment of the present application.

Fig. 5 is a schematic structural diagram of a volumetric video playing device according to an embodiment of the present application.

Fig. 6 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

The following description of the embodiments of the present application will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the embodiments described are only some, but not all, of the embodiments of the present application. All other embodiments, which can be made by those skilled in the art based on the embodiments herein without making any inventive effort, are intended to be within the scope of the present application.

In order to solve the technical problem, the embodiment of the application provides a method and a device for playing a volume video, electronic equipment and a storage medium. According to the volume video playing method, virtual elements are added to the area corresponding to the virtual object, so that the interestingness of playing the volume video is enhanced.

Among them, a volume Video (also called a Volumetric Video, a spatial Video, a Volumetric three-dimensional Video, or a 6-degree-of-freedom Video, etc.) is a technique for generating a three-dimensional model sequence by capturing information (such as depth information and color information, etc.) in a three-dimensional space. Compared with the traditional video, the volumetric video adds the concept of space into the video, and the three-dimensional model is used for better restoring the real three-dimensional world, instead of using a two-dimensional plane video plus a mirror to simulate the sense of space of the real three-dimensional world. Because the volume video is a three-dimensional model sequence, a user can adjust to any visual angle to watch according to own preference, and compared with a two-dimensional plane video, the volume video has higher reduction degree and immersion sense.

In some embodiments, the volumetric video is produced by the computer device after being captured by a professional capture system.

Referring to fig. 1, fig. 1 is a schematic view of a shooting system according to an embodiment of the present application.

As shown in fig. 1, the photographing system includes an electronic device, a signal source, a camera array and a microphone, wherein the camera array includes a plurality of cameras, each of the cameras is located at a different position, the signal source is connected with each of the cameras in the camera array, the electronic device is connected with the signal source, and the electronic device is connected with the camera array. The electronic device may be a computer, a server, or other electronic device with a certain computing capability.

When a plurality of cameras in the camera array need to shoot a shot object in the camera array, the electronic device can control the signal source to send pulse control signals to each camera at the same time, and after each camera receives the pulse control signals, each camera can shoot the shot object.

In some embodiments, the camera array includes a plurality of locations, and each location may have a plurality of camera modules, and each camera module may have a plurality of cameras disposed therein. For example, in a space perpendicular to the ground, different camera modules are arranged at different heights, and each camera module may include a color camera for capturing a color image and may also include a depth camera. The photographed image photographed by one camera module may include a color image and a depth image.

After the camera array finishes shooting the shooting object, the electronic equipment can receive shooting images sent by each camera in the camera array and time corresponding to the shooting images, and then the electronic equipment performs subsequent image processing according to the received shooting images and time corresponding to the shooting images.

In the process of photographing the photographed object, the electronic device may also start recording the sound emitted by the photographed object, such as through the microphone shown in fig. 1. The microphone may be disposed above an area surrounded by the camera array, and may be disposed on the subject, so as to capture sound.

In some implementations, after the electronic device receives the captured image, it can be determined as image information for a subsequent generation of the volumetric video. After the electronic device receives the audio, it may be determined as corresponding sound information in a subsequent volume of video.

After obtaining the image information and the sound information of the photographic subject, optionally, in the present application, a three-dimensional model for constituting the volumetric video may be reconstructed as follows:

firstly, color images and depth images of different visual angles of a shooting object and camera parameters corresponding to the color images are acquired; and training a neural network model implicitly expressing a three-dimensional model of the shooting object according to the acquired color image and the corresponding depth image and camera parameters, and extracting an isosurface based on the trained neural network model to realize three-dimensional reconstruction of the shooting object so as to obtain the three-dimensional model of the shooting object.

It should be noted that, in the embodiments of the present application, the neural network model of which architecture is adopted is not particularly limited, and may be selected by those skilled in the art according to actual needs. For example, a multi-layer perceptron (Multilayer Perceptron, MLP) without a normalization layer may be selected as a base model for model training.

The three-dimensional model reconstruction method provided in the present application will be described in detail below.

Firstly, a plurality of color cameras and depth cameras can be synchronously adopted to shoot a target object (the target object is a shooting object) which needs to be subjected to three-dimensional reconstruction at multiple visual angles, so as to obtain color images and corresponding depth images of the target object at multiple different visual angles, namely, at the same shooting moment (the difference value of actual shooting moments is smaller than or equal to a time threshold, namely, the shooting moments are considered to be the same), the color cameras at all visual angles shoot to obtain color images of the target object at the corresponding visual angles, and correspondingly, the depth cameras at all visual angles shoot to obtain depth images of the target object at the corresponding visual angles. The target object may be any object, including but not limited to living objects such as a person, an animal, and a plant, or inanimate objects such as a machine, furniture, and a doll.

Therefore, the color images of the target object at different visual angles are provided with the corresponding depth images, namely, when shooting, the color cameras and the depth cameras can adopt the configuration of a camera set, and the color cameras at the same visual angle are matched with the depth cameras to synchronously shoot the same target object. For example, a studio may be built, in which a central area is a photographing area, around which a plurality of sets of color cameras and depth cameras are paired at a certain angle interval in a horizontal direction and a vertical direction. When the target object is in the shooting area surrounded by the color cameras and the depth cameras, the color images and the corresponding depth images of the target object at different visual angles can be obtained through shooting by the color cameras and the depth cameras.

In addition, camera parameters of the color camera corresponding to each color image are further acquired. The camera parameters include internal parameters and external parameters of the color camera, which can be determined through calibration, wherein the internal parameters of the color camera are parameters related to the characteristics of the color camera, including but not limited to data such as focal length and pixels of the color camera, and the external parameters of the color camera are parameters of the color camera in a world coordinate system, including but not limited to data such as position (coordinates) of the color camera and rotation direction of the camera.

As described above, after obtaining the color images of the target object at different viewing angles and the corresponding depth images thereof at the same shooting time, the three-dimensional reconstruction of the target object can be performed according to the color images and the corresponding depth images thereof. Different from the mode of converting depth information into point cloud to perform three-dimensional reconstruction in the related technology, the method and the device train a neural network model to achieve implicit expression of the three-dimensional model of the target object, so that three-dimensional reconstruction of the target object is achieved based on the neural network model.

Optionally, the application selects a multi-layer perceptron (Multilayer Perceptron, MLP) that does not include a normalization layer as the base model, and trains as follows:

Converting pixel points in each color image into rays based on corresponding camera parameters;

sampling a plurality of sampling points on the rays, and determining first coordinate information of each sampling point and an SDF value of each sampling point from a pixel point;

inputting the first coordinate information of the sampling points into a basic model to obtain a predicted SDF value and a predicted RGB color value of each sampling point output by the basic model;

based on a first difference between the predicted SDF value and the SDF value and a second difference between the predicted RGB color value and the RGB color value of the pixel point, adjusting parameters of the basic model until a preset stop condition is met;

and taking the basic model meeting the preset stopping condition as a neural network model of the three-dimensional model of the implicitly expressed target object.

Firstly, converting a pixel point in a color image into a ray based on camera parameters corresponding to the color image, wherein the ray can be a ray passing through the pixel point and perpendicular to a color image surface; then, sampling a plurality of sampling points on the ray, wherein the sampling process of the sampling points can be executed in two steps, partial sampling points can be uniformly sampled firstly, and then the plurality of sampling points are further sampled at a key position based on the depth value of the pixel point, so that the condition that the sampling points near the surface of the model can be sampled as many as possible is ensured; then, calculating first coordinate information of each sampling point in a world coordinate system and a directional distance (Signed Distance Field, SDF) value of each sampling point according to camera parameters and depth values of the pixel points, wherein the SDF value can be a difference value between the depth value of the pixel point and the distance between the sampling point and an imaging surface of a camera, the difference value is a signed value, when the difference value is a positive value, the sampling point is outside the three-dimensional model, when the difference value is a negative value, the sampling point is inside the three-dimensional model, and when the difference value is zero, the sampling point is on the surface of the three-dimensional model; then, after sampling of the sampling points is completed and the SDF value corresponding to each sampling point is obtained through calculation, first coordinate information of the sampling points in a world coordinate system is further input into a basic model (the basic model is configured to map the input coordinate information into the SDF value and the RGB color value and then output), the SDF value output by the basic model is recorded as a predicted SDF value, and the RGB color value output by the basic model is recorded as a predicted RGB color value; then, parameters of the basic model are adjusted based on a first difference between the predicted SDF value and the SDF value corresponding to the sampling point and a second difference between the predicted RGB color value and the RGB color value of the pixel point corresponding to the sampling point.

In addition, for other pixel points in the color image, sampling is performed in the above manner, and then coordinate information of the sampling point in the world coordinate system is input to the basic model to obtain a corresponding predicted SDF value and a predicted RGB color value, which are used for adjusting parameters of the basic model until a preset stopping condition is met, for example, the preset stopping condition may be configured to reach a preset number of iterations of the basic model, or the preset stopping condition may be configured to converge the basic model. When the iteration of the basic model meets the preset stopping condition, the neural network model which can accurately and implicitly express the three-dimensional model of the shooting object is obtained. Finally, an isosurface extraction algorithm can be adopted to extract the three-dimensional model surface of the neural network model, so that a three-dimensional model of the shooting object is obtained.

Optionally, in some embodiments, determining an imaging plane of the color image based on camera parameters; and determining that the rays passing through the pixel points in the color image and perpendicular to the imaging surface are rays corresponding to the pixel points.

The coordinate information of the color image in the world coordinate system, namely the imaging surface, can be determined according to the camera parameters of the color camera corresponding to the color image. Then, it can be determined that the ray passing through the pixel point in the color image and perpendicular to the imaging plane is the ray corresponding to the pixel point.

Optionally, in some embodiments, determining second coordinate information and rotation angle of the color camera in the world coordinate system according to the camera parameters; and determining an imaging surface of the color image according to the second coordinate information and the rotation angle.

Optionally, in some embodiments, the first number of first sampling points are equally spaced on the ray; determining a plurality of key sampling points according to the depth values of the pixel points, and sampling a second number of second sampling points according to the key sampling points; the first number of first sampling points and the second number of second sampling points are determined as a plurality of sampling points obtained by sampling on the rays.

Firstly uniformly sampling n (i.e. a first number) first sampling points on rays, wherein n is a positive integer greater than 2; then, according to the depth value of the pixel point, determining a preset number of key sampling points closest to the pixel point from n first sampling points, or determining key sampling points smaller than a distance threshold from the pixel point from n first sampling points; then, resampling m second sampling points according to the determined key sampling points, wherein m is a positive integer greater than 1; and finally, determining the n+m sampling points obtained by sampling as a plurality of sampling points obtained by sampling on the rays. The m sampling points are sampled again at the key sampling points, so that the training effect of the model is more accurate at the surface of the three-dimensional model, and the reconstruction accuracy of the three-dimensional model is improved.

Optionally, in some embodiments, determining a depth value corresponding to the pixel point according to a depth image corresponding to the color image; calculating an SDF value of each sampling point from the pixel point based on the depth value; and calculating coordinate information of each sampling point according to the camera parameters and the depth values.

After a plurality of sampling points are sampled on the rays corresponding to each pixel point, for each sampling point, determining the distance between the shooting position of the color camera and the corresponding point on the target object according to the camera parameters and the depth value of the pixel point, and then calculating the SDF value of each sampling point one by one and the coordinate information of each sampling point based on the distance.

After the training of the basic model is completed, for the coordinate information of any given point, the corresponding SDF value of the basic model after the training is completed can be predicted by the basic model after the training is completed, and the predicted SDF value indicates the position relationship (internal, external or surface) between the point and the three-dimensional model of the target object, so as to realize the implicit expression of the three-dimensional model of the target object and obtain the neural network model for implicitly expressing the three-dimensional model of the target object.

Finally, performing iso-surface extraction on the neural network model, for example, drawing the surface of the three-dimensional model by using an iso-surface extraction algorithm (MC), so as to obtain the surface of the three-dimensional model, and further obtaining the three-dimensional model of the target object according to the surface of the three-dimensional model.

According to the three-dimensional reconstruction scheme, the three-dimensional model of the target object is implicitly modeled through the neural network, and depth information is added to improve the training speed and accuracy of the model. By adopting the three-dimensional reconstruction scheme provided by the application, the three-dimensional reconstruction is continuously carried out on the shooting object in time sequence, so that three-dimensional models of the shooting object at different moments can be obtained, and a three-dimensional model sequence formed by the three-dimensional models at different moments according to the time sequence is the volume video shot by the shooting object. Therefore, the volume video shooting can be carried out on any shooting object, and the volume video with specific content presentation can be obtained. For example, the dance shooting object can be shot with a volume video to obtain a volume video of dance of the shooting object at any angle, the teaching shooting object can be shot with a volume video to obtain a teaching volume video of the shooting object at any angle, and the like.

It should be noted that, the volume video according to the following embodiments of the present application may be obtained by shooting using the above volume video shooting method.

In order to understand the method for playing the video volume according to the embodiment of the present application in more detail, please refer to fig. 2, fig. 2 is a schematic flow chart of the method for playing the video volume according to the embodiment of the present application. The volume video playing method can comprise the following steps:

110. Limb movements of a virtual object in a volumetric video are identified.

In some embodiments, the virtual object in the volumetric video may perform various limb movements, such as various limb movements of the virtual object's arms, head, legs, etc. may occur while the virtual object is dancing.

In some implementations, the electronic device can identify a limb in the volumetric video in which the virtual object is acting; and determining a plurality of limb nodes on the limb with the action, and determining the limb action of the virtual object according to the spatial position relation among the limb nodes.

In particular, the electronic device may identify the limb in the volumetric video in which the virtual object is acting through a neural network model. For example, the virtual object is a dance person, the neural network model may set limb nodes on a plurality of positions on the body of the person, such as elbow joints, fingers, wrists, heads, five sense organs, and the like, according to the human body model, and may set the limb node of each limb as a limb node group, and when the spatial position relationship between different limb nodes in each limb node group is changed, it is confirmed that the limb corresponding to the limb node group acts.

The electronic device may determine a specific limb motion of each limb according to a spatial positional relationship occurring between different limb nodes within each limb node group. For example, when the limb nodes of the arms and the limb nodes of the shoulders and the limb nodes of the elbows of the arms are on the same horizontal line, the limb action corresponding to the arms is to lift the arms.

In some implementations, if multiple virtual objects exist in the volumetric video, different virtual objects may be detected by a multi-target detection model.

By way of example and not limitation, the multi-target detection model may be a YOLOv3 (You Only Look Once Version 3, you only need to look through third edition) detection model. The multi-target tracking model may be a deep tracking model. The Yolov3 detection model comprises a Darknet 53 feature extraction module, an up-sampling and feature fusion module and a regression analysis module. The deep SORT tracking model comprises a target feature modeling module, kalman filtering and a Hungary algorithm.

Detecting and tracking a plurality of moving objects in a plurality of video frames using a YOLOv3 detection model and a deep start tracking model, the process comprising: for each of the plurality of video frames, first, the video frame is input into the YOLOv3 detection model for processing. Specifically, the feature extraction module of the dark 53 extracts the feature in the video frame, the up-sampling and feature fusion module and the regression analysis module process the extracted feature, detect a plurality of moving targets in each video frame, and determine a prediction frame corresponding to the plurality of moving targets in each video frame. Then, the prediction box of each moving object is input into the deep start tracking model for processing. Specifically, modeling target characteristics according to a prediction frame of each moving target, matching and tracking the moving targets of two adjacent video frames in a plurality of video frames by using Kalman filtering and Hungary algorithm, and determining the moving positions of the same moving target in the two adjacent video frames so as to track the plurality of moving targets in the plurality of video frames. For each moving object, it can be regarded as a virtual object.

In some implementations, the electronic device can detect limb movements of the virtual object by acquiring the gesture recognition model. By way of example and not limitation, the gesture recognition model may be a 2s AGCN model (Two Stream Adaptive Graph Convolutional Networks, dual-stream adaptive graph rolling network). The 2s AGCN model includes two branches of B Stream (Bone Stream) and J Stream (Joint Stream). The J Stream is used for extracting features of the gesture joint point information in the gesture information, the B Stream is used for extracting features of the skeleton information in the gesture information, and the two split features are fused and classified through SoftMax, so that limb actions of the virtual object can be determined.

It should be noted that, in the embodiment of the present application, the virtual object may be detected by the multi-target detection model, and then the limb action corresponding to each virtual object may be identified by the gesture recognition model.

In some embodiments, the electronic device may further obtain a limb action corresponding to a spatial position relationship between different limb nodes in the database, and after obtaining the spatial position relationship between the limb nodes of the character, match the spatial position relationship with the spatial position relationship between the same limb nodes in the database, so as to determine the limb action corresponding to the limb of the character in the database. Such as lifting an arm, jumping, bending a leg, etc.

120. And determining at least one characteristic point on the limb corresponding to the virtual object according to the limb action of the virtual object.

In some embodiments, the electronic device may identify that the limb movement of the virtual object may be plural, and the electronic device may select at least one target limb movement from the plural limb movements, and then determine a limb corresponding to the target limb movement, and determine at least one feature point on the limb.

For example, the palm and the legs of the virtual object each have a limb motion, the electronic device may select the limb motion of the palm as the target limb motion, and then determine at least one feature point on the palm. It is understood that the feature point may be understood as a point for determining the approximate position of the virtual element when the virtual element is subsequently set.

In some embodiments, if there are multiple limb movements of the virtual object, the electronic device may determine the target limb movement according to the magnitude of the movement corresponding to each limb movement, for example, the limb movement of the arm of the virtual object is larger in magnitude and the limb movement of the waist is smaller in magnitude, and then select the limb movement of the arm as the target limb movement.

In some embodiments, if the virtual object has multiple limb actions, the electronic device may determine the target limb action according to the size of the coverage area corresponding to each limb action. For example, if the coverage area of the limb movement of the leg of the virtual object is large and the coverage area of the limb movement of the hand is small, the limb movement of the hand is selected as the target limb movement.

In some implementations, the electronic device may also determine limb movements of the multi-frame virtual object; determining a first target area where a limb is not blocked by other parts of the virtual object in multi-frame limb movements; at least one feature point is determined in the first target region.

For example, during the playing of the volumetric video, the limb of the virtual object is moved, thereby generating a limb motion, and the electronic device may determine the limb motion of the virtual object for each frame. For example, the limb motion is a limb motion of the hand, and in the process of executing the limb motion of the hand, the finger part is not blocked by other parts of the virtual object, for example, the finger part is not blocked by the parts of the arm, the head and the like of the virtual object, and then the region corresponding to the finger part is selected as the first target region. At least one feature point is then selected on the first target area.

In some embodiments, if there are a plurality of virtual elements to be displayed, a plurality of feature points may be determined on the first target area of the limb, where each feature point corresponds to one virtual element. If one virtual element needs to be displayed, a feature point can be determined on the first target area of the limb, and the feature point corresponds to the virtual element. If there are multiple virtual elements to be displayed, but there is only one feature point on the first target area of the limb, the one feature point may correspond to the multiple virtual elements.

130. And determining corresponding associated areas near the limbs according to at least one characteristic point, and setting at least one virtual element in each associated area.

In some embodiments, the electronic device may determine a region to be selected that is closest to the feature point and does not occlude the limb; the associated region is then determined among the regions to be selected.

For example, the limb is a hand of the virtual object, a feature point is arranged on a finger tip of the virtual object, and a feature point is arranged on a back of the hand of the virtual object, so that a region to be selected, which is closest to the feature point of the finger tip and does not block the limb, can be determined, and a region to be selected, which is closest to the feature point of the back of the hand and does not block the limb, can be determined. And finally, determining the associated area in the areas to be selected.

In some embodiments, the electronic device may determine, during execution of the limb action, a region to be selected that is least blocked from a preset region of the virtual object; and determining the region to be selected with minimum shielding on the preset region of the virtual object as the associated region.

For example, during the process of executing the limb action by the virtual object, some preset areas are often shielded, for example, the preset areas are faces of the virtual object, and when the virtual object dances, the hands execute the limb action and the faces are shielded. And in the areas to be selected corresponding to the finger tip feature points and the areas to be selected corresponding to the back feature points, the areas to be selected corresponding to the finger tip feature points are selected as the associated areas if the areas to be selected corresponding to the finger tip feature points have less shielding on the face of the virtual object.

140. When the volumetric video is played, limb movements of the virtual object are presented and at least one virtual element is presented in the associated region.

In some embodiments, when the volume video is played, determining the feature point position of the limb of each frame of virtual object; at least one virtual element is shown in the associated region according to the feature point location of each frame.

It can be understood that when the virtual object performs the limb action, the feature points corresponding to the limbs change in the three-dimensional space, so that the associated areas corresponding to the feature points also change, and the virtual elements are displayed in different spatial positions along with the change of the associated areas.

For example, when the virtual object is dancing, the feature point is at the finger tip of the virtual object, the finger tip of the user is placed in front of the chest in the last second, and the finger tip gradually moves to the shoulder in the last seconds, so that the feature point of the finger tip is correspondingly changed in spatial position. The electronic device determines an associated region corresponding to the feature point on each frame, and then sets and displays at least one virtual element on the associated region.

For example, in order to increase the interest when playing the volume video, in the embodiment of the present application, a virtual element may be added on or near a limb corresponding to the virtual object, where the virtual element may be a visual virtual element such as an animation, a picture, a text, a light effect, or the like.

In some embodiments, a corresponding sound source location may be set on the virtual element, which may play the sound source of the virtual element when the virtual element is played with the virtual object. Such as playing music, speech, special effects, etc.

For example, when the volume video corresponding to the virtual object is played, the virtual object can play the optical special effect of the virtual element at the same time when the virtual object is dancing, and the corresponding instrument sound is played at the sound source position, so that the concert effect with more atmosphere sense and fun is created.

In order to understand the method for playing the video volume according to the embodiment of the present application in more detail, please continue to refer to fig. 3, fig. 3 is a second flow chart of the method for playing the video volume according to the embodiment of the present application. The volume video playing method can comprise the following steps:

201. limb movements of a virtual object in a volumetric video are identified.

202. A limb action of the multi-frame virtual object is determined.

In some embodiments, the limb movement of the virtual object is continuously changed during the movement process, and the electronic device may determine a preset time period for implementing the virtual element display, and then acquire the limb movement of the multi-frame virtual object in the preset time period.

203. And determining a first target area in which the limb is not blocked by other parts of the virtual object in the multi-frame limb movement, and determining at least one characteristic point in the first target area.

It is understood that the feature point may be understood as a point for determining the approximate position of the virtual element when the virtual element is subsequently set.

204. And determining the region to be selected which is nearest to the characteristic points and does not shade the limb.

In some embodiments, in order to display the virtual element without shielding the corresponding limb, for example, in order to display the virtual element without shielding the hand of the user, an area closest to the feature point of the hand may be the area to be selected.

The region to be selected is closer to the hand, does not shade the hand, and is convenient for subsequent display of the virtual elements.

205. And determining the area to be selected with minimum shielding to the preset area of the virtual object in the process of executing the limb action.

For example, during the process of executing the limb action by the virtual object, some preset areas are often shielded, for example, the preset areas are faces of the virtual object, and when the virtual object dances, the hands execute the limb action and the faces are shielded. And in the areas to be selected corresponding to the finger tip feature points and the areas to be selected corresponding to the back feature points, the areas to be selected corresponding to the finger tip feature points are less in shielding of the faces of the virtual objects, and the areas to be selected corresponding to the finger tip feature points are determined to be the areas to be selected with the least shielding of the preset areas of the virtual objects.

206. And determining the region to be selected with minimum shielding on the preset region of the virtual object as an associated region, and setting at least one virtual element in each associated region.

In some implementations, the electronic device may determine a region to be selected that has minimal occlusion to a preset region of the virtual object as the associated region. For example, in the above example, if the region to be selected corresponding to the finger tip feature point has less occlusion on the face of the virtual object, the region to be selected corresponding to the finger tip feature point is selected as the associated region.

In some embodiments, at least one virtual element is set in each associated region, e.g., the virtual elements to be presented are picture 1, picture 2, and picture 3, then picture 1, picture 2, and picture 3 are all set in the associated region.

In some embodiments, if the limb is occluded by other portions of the virtual object during multiple limb movements; and determining a second target area with the minimum shielding times in the multi-frame limb actions, and determining at least one characteristic point in the second target area.

For example, in the process that the virtual object performs the limb action, the limb action is performed by the hand when the virtual object dances, and is blocked by other parts of the virtual object. The arm executes the limb action and can be shielded by other parts of the virtual object.

And the limb actions corresponding to the arm are less blocked by other parts of the virtual object, at least one characteristic point is determined on the arm, and then the associated area corresponding to the characteristic point is determined.

207. And when the volume video is played, determining the characteristic point positions of limbs of each frame of virtual object.

The electronic device may determine a feature point location of a limb of the virtual object for each frame.

208. At least one virtual element is shown in the associated region according to the feature point location of each frame.

In the embodiment of the application, the electronic equipment identifies the limb action of the virtual object in the volume video; determining limb actions of the multi-frame virtual object; determining a first target area in which the limb is not blocked by other parts of the virtual object in multi-frame limb movements, and determining at least one characteristic point in the first target area; determining a region to be selected which is nearest to the feature points and does not shade limbs; determining a region to be selected with minimum shielding to a preset region of a virtual object in the process of executing limb actions; determining a region to be selected with minimum shielding on a preset region of a virtual object as an associated region, and setting at least one virtual element in each associated region; when the volume video is played, determining the position of a feature point of a limb of each frame of virtual object; at least one virtual element is shown in the associated region according to the feature point location of each frame. When the volume video is played, virtual elements are added to the area corresponding to the virtual object, so that the interestingness of the volume video is enhanced.

Referring to fig. 4, fig. 4 is a schematic view of a scenario of a video playback method according to an embodiment of the present application.

The virtual object in the figure is a dancer, when the dancer dances, the limb actions of the arms are changed continuously, and the electronic equipment can identify the limb actions of the virtual object in the volume video; determining at least one characteristic point on a limb corresponding to the virtual object according to the limb action of the virtual object; determining corresponding associated areas near the limbs according to at least one characteristic point, and setting at least one virtual element in each associated area; when the volumetric video is played, limb movements of the virtual object are presented and at least one virtual element is presented in the associated region.

Wherein the previous frame, the associated region is in front of the virtual object, then the virtual element, such as animation, picture, text, is displayed in the associated region. And in the next frame, if the associated area is in front of the chest of the virtual object, the virtual element moves the virtual element to the chest of the virtual object for display according to the associated area corresponding to the feature point. Thereby enhancing the interest when playing the volume video.

Referring to fig. 5, fig. 5 is a schematic structural diagram of a volumetric video playing device according to an embodiment of the present application. The volume video playback device 300 may include:

The identifying module 310 is configured to identify a limb action of a virtual object in the volumetric video.

The identifying module 310 is further configured to identify a limb in the volumetric video in which the virtual object acts;

and determining a plurality of limb nodes on the limb with the action, and determining the limb action of the virtual object according to the spatial position relation among the limb nodes.

The determining module 320 is configured to determine at least one feature point on a limb corresponding to the virtual object according to the limb motion of the virtual object.

A determining module 320, configured to determine a limb action of the multi-frame virtual object;

determining a first target area where a limb is not blocked by other parts of the virtual object in multi-frame limb movements;

at least one feature point is determined in the first target region.

The determining module 320 is further configured to, if the limb is blocked by other parts of the virtual object in the multiple limb movements;

and determining a second target area with the minimum shielding times in the multi-frame limb actions, and determining at least one characteristic point in the second target area.

The setting module 330 is configured to determine a corresponding associated region near the limb according to at least one feature point, and set at least one virtual element in each associated region.

The setting module 330 is further configured to determine a region to be selected that is closest to the feature point and does not occlude the limb;

and determining an associated area in the areas to be selected.

The setting module 330 is further configured to determine, during execution of the limb action, a region to be selected with minimal shielding to a preset region of the virtual object;

and determining the region to be selected with minimum shielding on the preset region of the virtual object as the associated region.

A display module 340 for displaying limb actions of the virtual object and at least one virtual element in the associated region when the volumetric video is played.

The display module 340 is further configured to determine a feature point position of a limb of each frame of the virtual object when the volume video is played;

at least one virtual element is shown in the associated region according to the feature point location of each frame.

In the embodiment of the application, the volume video playing device identifies the limb action of the virtual object in the volume video; determining at least one characteristic point on a limb corresponding to the virtual object according to the limb action of the virtual object; determining corresponding associated areas near the limbs according to at least one characteristic point, and setting at least one virtual element in each associated area; when the volumetric video is played, limb movements of the virtual object are presented and at least one virtual element is presented in the associated region. In the embodiment of the application, when the volume video is played, the virtual elements are added in the area corresponding to the virtual object, so that the interestingness of the volume video is enhanced.

Accordingly, the present embodiment also provides an electronic device, as shown in fig. 6, where the electronic device 400 may include a memory 401 including one or more computer readable storage media, an input unit 402, a display unit 403, a sensor 404, a processor 405 including one or more processing cores, and a power supply 406. It will be appreciated by those skilled in the art that the electronic device structure shown in fig. 6 is not limiting of the electronic device and may include more or fewer components than shown, or may combine certain components, or a different arrangement of components. Wherein:

the memory 401 may be used to store software programs and modules, and the processor 405 executes various functional applications and data processing by executing the software programs and modules stored in the memory 401. The memory 401 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program (such as a sound playing function, an image playing function, etc.) required for at least one function, and the like; the storage data area may store data created according to the use of the electronic device (such as audio data, phonebooks, etc.), and the like. In addition, memory 401 may include high-speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid-state storage device. Accordingly, the memory 401 may further include a memory controller to provide access to the memory 401 by the processor 405 and the input unit 402.

The input unit 402 may be used to receive input numeric or character information and to generate keyboard, mouse, joystick, optical or trackball signal inputs related to user settings and function control. In particular, in one particular embodiment, the input unit 402 may include a touch-sensitive surface, as well as other input devices. The touch-sensitive surface, also referred to as a touch display screen or a touch pad, may collect touch operations thereon or thereabout by a user (e.g., operations thereon or thereabout by a user using any suitable object or accessory such as a finger, stylus, etc.), and actuate the corresponding connection means according to a predetermined program. Alternatively, the touch-sensitive surface may comprise two parts, a touch detection device and a touch controller. The touch detection device detects the touch azimuth of a user, detects a signal brought by touch operation and transmits the signal to the touch controller; the touch controller receives touch information from the touch detection device and converts it into touch point coordinates, which are then sent to the processor 405, and can receive commands from the processor 405 and execute them. In addition, touch sensitive surfaces may be implemented in a variety of types, such as resistive, capacitive, infrared, and surface acoustic waves. In addition to the touch-sensitive surface, the input unit 402 may also include other input devices. In particular, other input devices may include, but are not limited to, one or more of a physical keyboard, function keys (such as volume control keys, switch keys, etc.), a trackball, mouse, joystick, etc.

The display unit 403 may be used to display information entered by a user or provided to a user as well as various graphical user interfaces of the electronic device, which may be composed of graphics, text, icons, video and any combination thereof. The display unit 403 may include a display panel, which may be optionally configured in the form of a liquid crystal display (LCD, liquid Crystal Display), an Organic Light-Emitting Diode (OLED), or the like. Further, the touch-sensitive surface may overlay a display panel, and upon detection of a touch operation thereon or thereabout, the touch-sensitive surface is passed to the processor 405 to determine the type of touch event, and the processor 405 then provides a corresponding visual output on the display panel based on the type of touch event. Although in fig. 6 the touch sensitive surface and the display panel are implemented as two separate components for input and output functions, in some embodiments the touch sensitive surface may be integrated with the display panel to implement the input and output functions.

The electronic device may also include at least one sensor 404, such as a light sensor, a motion sensor, and other sensors. In particular, the light sensor may include an ambient light sensor that may adjust the brightness of the display panel according to the brightness of ambient light, and a proximity sensor that may turn off the display panel and/or backlight when the electronic device is moved to the ear. As one of the motion sensors, the gravity acceleration sensor can detect the acceleration in all directions (generally three axes), and can detect the gravity and the direction when the device is stationary, and the device can be used for recognizing the gesture of the electronic equipment (such as horizontal and vertical screen switching, related games, magnetometer gesture calibration), vibration recognition related functions (such as pedometer and knocking), and the like; other sensors such as gyroscopes, barometers, hygrometers, thermometers, infrared sensors, etc. that may also be configured with the electronic device are not described in detail herein.

The processor 405 is a control center of the electronic device, connects various parts of the entire electronic device using various interfaces and lines, and performs various functions of the electronic device and processes data by running or executing software programs and/or modules stored in the memory 401, and calling data stored in the memory 401, thereby performing overall monitoring of the electronic device. Optionally, the processor 405 may include one or more processing cores; preferably, the processor 405 may integrate an application processor and a modem processor, wherein the application processor primarily handles operating systems, user interfaces, applications, etc., and the modem processor primarily handles wireless communications. It will be appreciated that the modem processor described above may not be integrated into the processor 405.

The electronic device also includes a power supply 406 (e.g., a battery) for powering the various components, which may be logically connected to the processor 405 via a power management system so as to perform functions such as managing charge, discharge, and power consumption via the power management system. The power supply 406 may also include one or more of any components, such as a direct current or alternating current power supply, a recharging system, a power failure detection circuit, a power converter or inverter, a power status indicator, and the like.

Although not shown, the electronic device may further include a camera, a bluetooth module, etc., which will not be described herein. In particular, in this embodiment, the processor 405 in the electronic device loads the computer program stored in the memory 401, and the processor 405 implements various functions in the volumetric video playing method by loading the computer program:

identifying limb actions of a virtual object in the volumetric video;

Those of ordinary skill in the art will appreciate that all or a portion of the steps of the various methods of the above embodiments may be performed by instructions, or by instructions controlling associated hardware, which may be stored in a computer-readable storage medium and loaded and executed by a processor.

To this end, embodiments of the present application provide a computer readable storage medium having stored therein a plurality of instructions capable of being loaded by a processor to perform steps in any of the volumetric video playback methods provided by embodiments of the present application. For example, the instructions may perform the steps of:

Identifying limb actions of a virtual object in the volumetric video;

The specific implementation of each operation above may be referred to the previous embodiments, and will not be described herein.

Wherein the storage medium may include: read Only Memory (ROM), random access Memory (RAM, random Access Memory), magnetic or optical disk, and the like.

Because the instructions stored in the storage medium can execute steps in any one of the video playing method or the image processing method provided in the embodiments of the present application, the beneficial effects that any one of the video playing method or the image processing method provided in the embodiments of the present application can be achieved, which are detailed in the previous embodiments and are not described herein again.

The foregoing describes in detail a method, apparatus, electronic device, and storage medium for playing a video volume according to the embodiments of the present application, and specific examples are applied to describe principles and implementations of the present application, where the descriptions of the foregoing embodiments are only used to help understand the method and core ideas of the present application; meanwhile, those skilled in the art will have variations in the specific embodiments and application scope in light of the ideas of the present application, and the present description should not be construed as limiting the present application in view of the above.

Claims

1. A method for playing a video volume, comprising:

identifying limb actions of a virtual object in the volumetric video;

determining corresponding associated areas near the limb according to the at least one characteristic point, and setting at least one virtual element in each associated area;

when the volume video is played, limb actions of the virtual object are displayed and the at least one virtual element is displayed in the associated region.

2. The method for playing back a volumetric video according to claim 1, wherein the identifying the limb motion of the virtual object in the volumetric video comprises:

identifying limbs of the virtual object in the volume video, wherein the limbs act;

3. The method for playing a volumetric video according to claim 1, wherein determining at least one feature point on a limb corresponding to the virtual object according to a limb motion of the virtual object comprises:

Determining limb actions of a plurality of frames of virtual objects;

determining a first target area of which the limb is not blocked by other parts of the virtual object in the multi-frame limb action;

and determining the at least one characteristic point in the first target area.

4. The method of claim 3, further comprising:

if the limb is blocked by other parts of the virtual object in the multi-frame limb movement;

and determining a second target area with the least shielding times in the multi-frame limb actions, and determining the at least one characteristic point in the second target area.

5. The method of claim 1, wherein said determining a corresponding associated region in the vicinity of the limb from the at least one feature point comprises:

determining a region to be selected which is nearest to the characteristic points and does not shade the limb;

and determining the associated area in the areas to be selected.

6. The method of claim 5, wherein determining the associated region in the regions to be selected comprises:

Determining a region to be selected with minimum shielding to a preset region of the virtual object in the process of executing the limb action;

and determining the area to be selected with minimum shielding on the preset area of the virtual object as the associated area.

7. The method of any of claims 1-6, wherein the presenting the limb action of the virtual object and presenting the at least one virtual element in the associated region when the volumetric video is played comprises:

when the volume video is played, determining the feature point position of the limb of the virtual object of each frame;

the at least one virtual element is shown in the associated region according to the feature point location of each frame.

8. A volumetric video playback device, comprising:

the setting module is used for determining corresponding associated areas near the limbs according to the at least one characteristic point and setting at least one virtual element in each associated area;

And the display module is used for displaying the limb action of the virtual object and displaying the at least one virtual element in the associated area when the volume video is played.

9. An electronic device, comprising:

a memory storing executable program code, a processor coupled to the memory;

the processor invokes the executable program code stored in the memory to perform the steps in the volumetric video playback method of any one of claims 1-7.

10. A storage medium storing a plurality of instructions adapted to be loaded by a processor to perform the steps of the method of volumetric video playback as claimed in any one of claims 1 to 7.