CN115802021A

CN115802021A - Volume video generation method and device, electronic equipment and storage medium

Info

Publication number: CN115802021A
Application number: CN202211328347.XA
Authority: CN
Inventors: 张煜; 蒋志鸿; 孙伟; 邵志兢
Original assignee: Zhuhai Prometheus Vision Technology Co ltd
Current assignee: Zhuhai Prometheus Vision Technology Co ltd
Priority date: 2022-10-27
Filing date: 2022-10-27
Publication date: 2023-03-14

Abstract

The embodiment of the application discloses a volume video generation method and device, electronic equipment and a storage medium. The method comprises the following steps: the electronic equipment acquires sound information and image information of a shot object; determining a target image in the image information and determining position information of a target part of a shot object in the target image; determining the associated sounds corresponding to the target image at the same time in the sound information, and determining the position information of the target part in the target image as the sound source position of the associated sounds; and generating a volume video corresponding to the shot object according to the image information, and storing the related sound and the sound source position of the related sound into the volume video. Therefore, the virtual object in the video product has a specific sound source position when sounding.

Description

Volume video generation method and device, electronic equipment and storage medium

Technical Field

The present application relates to the field of computer technologies, and in particular, to a method and an apparatus for generating a volume video, an electronic device, and a storage medium.

Background

In the related art, in playing some videos, some audio is set in the videos, for example, several sound sources are set at fixed spatial positions, and then the sounds are generated through cooperation between the sound sources, so that the sounds emitted at an approximate position are simulated.

However, this method can only allow the user to listen to the sound at an approximate position, and cannot accurately imitate the sound emitted from the spatial position where the person is specifically located in the video.

Disclosure of Invention

The embodiment of the application provides a volume video generation method and device, electronic equipment and a storage medium. The volume video generation method can enable the virtual object in the volume video to have a specific sound source position when the virtual object generates sound.

In a first aspect, an embodiment of the present application provides a method for generating a volumetric video, including:

acquiring sound information and image information of a shot object;

determining a target image in the image information and determining position information of a target part of a shot object in the target image;

determining the associated sounds corresponding to the target images at the same time in the sound information, and determining the position information of the target part in the target images as the sound source position of the associated sounds;

and generating a volume video corresponding to the shot object according to the image information, and storing the related sound and the sound source position of the related sound into the volume video.

In a second aspect, the present application provides a volumetric video generation apparatus, including:

the acquisition module is used for acquiring sound information and image information of a shot object;

the first determining module is used for determining a target image in the image information and determining the position information of a target part of a shot object in the target image;

the second determining module is used for determining the associated sounds corresponding to the target images at the same time in the sound information and determining the position information of the target part in the target images as the sound source position of the associated sounds;

and the generating module is used for generating a volume video corresponding to the shot object according to the image information and storing the associated sound and the sound source position of the associated sound into the volume video.

In a third aspect, an embodiment of the present application provides an electronic device, including: a memory storing executable program code, a processor coupled with the memory; the processor calls the executable program code stored in the memory to execute the steps in the volume video generation method provided by the embodiment of the application.

In a fourth aspect, an embodiment of the present application provides a storage medium, where the storage medium stores a plurality of instructions, and the instructions are suitable for being loaded by a processor to perform steps in a volumetric video generation method provided in the embodiment of the present application.

In the embodiment of the application, the electronic equipment acquires the sound information and the image information of a shot object; determining a target image in the image information and determining position information of a target part of a shot object in the target image; determining the associated sounds corresponding to the target image at the same time in the sound information, and determining the position information of the target part in the target image as the sound source position of the associated sounds; and generating a volume video corresponding to the shot object according to the image information, and storing the related sound and the sound source position of the related sound into the volume video. Therefore, the virtual object in the video product has a specific sound source position when sounding, and therefore audiences can hear the sound at the accurate sound source position.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

Fig. 1 is a scene schematic diagram of a shooting system provided in an embodiment of the present application.

Fig. 2 is a first flowchart of a volumetric video generation method according to an embodiment of the present disclosure.

Fig. 3 is a second flowchart of a volumetric video generation method according to an embodiment of the present application.

Fig. 4 is a scene schematic diagram of volume video playing provided in the embodiment of the present application.

Fig. 5 is a schematic structural diagram of a volumetric video generation apparatus according to an embodiment of the present application.

Fig. 6 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

In order to solve the technical problem, embodiments of the present application provide a volume video generation method, an apparatus, an electronic device, and a storage medium. The volume video generation method can enable the virtual object in the volume video to have a specific sound source position when the virtual object generates sound.

Referring to fig. 1, fig. 1 is a scene schematic diagram of a shooting system according to an embodiment of the present disclosure.

As shown in fig. 1, the shooting system includes an electronic device, a signal source, a camera array and a microphone, wherein the camera array includes a plurality of cameras, each camera is located at a different position, the signal source is connected to each camera in the camera array, the electronic device is connected to the signal source, and the electronic device is connected to the camera array. The electronic device may be a computer, a server, or other electronic devices with certain computing capabilities.

When a plurality of cameras in the camera array need to shoot a shot object in the camera array, the electronic device can control the signal source to simultaneously send the pulse control signal to each camera, and after each camera receives the pulse control signal, each camera can shoot the shot object.

In some embodiments, the camera array includes a plurality of positions, each of which can be configured with a plurality of camera modules, each of which can be configured with a plurality of cameras. For example, in a space with a position perpendicular to the ground, different camera modules are arranged at different heights, and each camera module may include a color camera for taking a color image and a depth camera. The captured image captured by one camera module may include a color image and a depth image.

After the camera array finishes shooting the shooting object, the electronic device may receive the shot image and the time corresponding to the shot image sent by each camera in the camera array, and then the electronic device performs subsequent image processing according to the received shot image and the time corresponding to the shot image.

During the shooting of the shot object, the electronic device may start recording the sound emitted by the shot object, such as recording through the microphone shown in fig. 1. The microphone position can be arranged above the area enclosed by the camera array, and the microphone can also be arranged on the shot object, so that the sound is captured.

In some implementations, after the electronic device receives the captured image, it can be determined as image information for subsequent generation of the volumetric video. After the electronic device receives the audio, it may be determined to be the corresponding sound information in the subsequent volumetric video.

Among them, volume Video (also called volume Video, spatial Video, volumetric three-dimensional Video, or 6-degree-of-freedom Video, etc.) is a technology for capturing information (such as depth information and color information, etc.) in a three-dimensional space and generating a three-dimensional model sequence. Compared with the traditional video, the volume video adds the concept of space into the video, and uses a three-dimensional model to better restore the three-dimensional world, rather than using a two-dimensional plane video and a moving mirror to simulate the sense of space of the three-dimensional world. Because the volume video is a three-dimensional model sequence, a user can adjust to any visual angle to watch the video according to the preference of the user, and the volume video has higher reduction degree and immersion feeling compared with a two-dimensional plane video.

For a more detailed understanding of the method for generating a volume video according to the embodiment of the present application, please continue to refer to fig. 2, where fig. 2 is a first flowchart of the method for generating a volume video according to the embodiment of the present application. The volume video generation method may include the steps of:

110. sound information and image information of a subject are acquired.

In some embodiments, the object is in a camera array surrounded by a plurality of cameras, the plurality of cameras in the camera array can shoot the object from a plurality of angles, and then the electronic device acquires shot images shot by each camera and determines the shot images as the image information of the object.

In some embodiments, during the shooting of the shot object, the electronic device may capture the sound emitted by the shot object from the beginning of shooting, so as to acquire the sound information of the shot object.

The subject may be an image of a living body capable of emitting sound, such as a human being, a cat, a dog, a panda, or the like.

In some embodiments, the subject may also be a subject that cannot emit sound, such as some models, hands, etc. The sound information of the subject is recorded by the bystander when the bystander carries out the bystander.

120. And determining a target image in the image information and determining the position information of a target part of the shot object in the target image.

In some embodiments, the electronic device may screen out multiple frames of images in the image information, and then determine them as the target image. For example, the target image is an image corresponding to the sound of the subject. For example, the target image may be an image of the subject whose position changes continuously. For example, the target image may be an image of a scene in which a plurality of persons or objects are present in the scene.

The target image may be a front image of the subject, such as an image taken by a camera facing the subject. The target image may also be an image having the smallest deviation angle from the vertical direction of the face of the subject, such as an image captured by a camera closest to the face of the subject.

The electronic device may then determine location information of a target portion of the subject within the target image. For example, the target region may be a mouth of the subject, the target region may be a nose of the subject, and the target region may be an arbitrary position of a face of the subject.

In some embodiments, the electronic device may further recognize a face of the subject whose target image is captured to determine the target portion; determining the corresponding two-dimensional coordinates of the target part in the target image; and determining the three-dimensional coordinates of the target part in the three-dimensional space according to the two-dimensional coordinates, and determining the three-dimensional coordinates as the position information of the target part.

Specifically, the electronic device may acquire a plurality of feature points on the face of the subject, determine a target feature point among the plurality of feature points, and determine the target portion according to the target feature point.

The electronic device can obtain the position relationship between every two feature points in the plurality of feature points, and finally determines the target feature point according to the position relationship.

For example, the object to be photographed is a person, and the electronic device may perform face recognition on a face of the person in the target image, thereby obtaining facial feature points, such as feature points corresponding to positions of a nose, glasses, a mouth, ears, eyebrows, and the like.

Then, the position relationship between each two feature points in the plurality of feature points is obtained, for example, a two-dimensional coordinate system is established on the face, then, a corresponding two-dimensional coordinate is set for each feature point, and then, the position relationship between each two feature points is determined according to the two-dimensional coordinate of each feature point. And when the position relations of the plurality of groups of two feature points meet the position relation between the mouth feature points in the database, determining that the group of feature points are the target feature points of the shot object. The position relation between every two characteristic points can be determined to ensure that the characteristic points are prevented from being identified wrongly, so that the target part is identified wrongly, and the sound source position of the shot object is determined wrongly subsequently.

The electronic device may determine the two-dimensional coordinates corresponding to the set of feature points as the two-dimensional coordinates corresponding to the target portion in the target image. And then the electronic equipment determines the three-dimensional coordinates of the target part in the three-dimensional space by carrying out back projection calculation on the two-dimensional coordinates of the target characteristic points.

Finally, the electronic device may determine three-dimensional coordinates of the target portion in the three-dimensional space as position information of the target portion of the subject.

130. And determining the associated sound corresponding to the target image at the same time in the sound information, and determining the position information of the target part in the target image as the sound source position of the associated sound.

In some embodiments, in order to implement sound and picture synchronization in a subsequent process, the electronic device may determine, in the sound information, an associated sound corresponding to the target image at the same time.

Alternatively, the electronic device may acquire the time corresponding to the target image, and then determine the associated sound at the same time in the sound information according to the time corresponding to the target image. For example, if the time corresponding to the target image is 3 minutes and 12 seconds, it is necessary to identify the sound corresponding to the 3 minutes and 12 seconds in the sound information, and then identify the sound corresponding to the 3 minutes and 12 seconds as the associated sound corresponding to the target image at the same time.

It can be understood that the target portion can be a sounding portion corresponding to the shot object, and the electronic device can determine the position information of the target portion in the target image as the sound source position of the associated sound, so that accurate positioning of the sound of the shot object can be achieved, and the subsequent volume video playing process is facilitated, the position of the sound can be changed according to the movement of a virtual object (the shot object) in the volume video, and accordingly the user can have more immersive sound effect experience when watching the volume video.

140. And generating a volume video corresponding to the shot object according to the image information, and storing the related sound and the sound source position of the related sound into the volume video.

In some implementations, after completing the filming of the filmed object, the electronic device may generate a volumetric video from the captured image information. Or the electronic device can generate the volume video according to the image information of the shot object captured in a period of time during the process that the shot object is shot.

In some embodiments, the electronic device may perform feature extraction on the captured image of each camera, so as to obtain a three-dimensional feature corresponding to the captured object. The electronic device can generate a corresponding volume video according to the three-dimensional features, wherein the volume video comprises a three-dimensional model corresponding to the shot object.

Alternatively, in the present application, the three-dimensional model used to construct the volumetric video may be reconstructed as follows:

firstly, color images and depth images of a shot object at different visual angles and camera parameters corresponding to the color images are obtained; and then training a neural network model for implicitly expressing a three-dimensional model of the shot object according to the obtained color image and the depth image and camera parameters corresponding to the color image, and performing isosurface extraction based on the trained neural network model to realize three-dimensional reconstruction of the shot object so as to obtain the three-dimensional model of the shot object.

It should be noted that, in the embodiment of the present application, no particular limitation is imposed on what architecture is adopted in the neural network model, and the neural network model can be selected by a person skilled in the art according to actual needs. For example, a multi-layer Perceptron (MLP) without a normalization layer may be selected as a base model for model training.

The three-dimensional model reconstruction method provided by the present application will be described in detail below.

Firstly, a plurality of color cameras and depth cameras can be synchronously adopted to shoot a target object (the target object is a shooting object) which needs to be subjected to three-dimensional reconstruction, so that color images and corresponding depth images of the target object at a plurality of different visual angles are obtained, namely, at the same shooting moment (the difference value of the actual shooting moment is less than or equal to a time threshold value, namely, the shooting moments are considered to be the same), the color cameras at all the visual angles shoot the color images of the target object at the corresponding visual angles, and correspondingly, the depth cameras at all the visual angles shoot the depth images of the target object at the corresponding visual angles. It should be noted that the target object may be any object, including but not limited to a living object such as a person, an animal, and a plant, or a non-living object such as a machine, furniture, and a doll.

Therefore, the color images of the target object at different view angles all have corresponding depth images, namely, when shooting is carried out, the color camera and the depth camera can adopt the configuration of the camera set, and the color camera at the same view angle is matched with the depth camera to synchronously shoot the same target object. For example, a studio may be constructed, the central area of which is a shooting area, around which multiple sets of color cameras and depth cameras are paired at certain angles in the horizontal and vertical directions. When the target object is in the shooting area surrounded by the color cameras and the depth cameras, the color images and the corresponding depth images of the target object at different view angles can be obtained through shooting by the color cameras and the depth cameras.

In addition, camera parameters of the color camera corresponding to each color image are further acquired. The camera parameters include internal and external parameters of the color camera, which can be determined by calibration, the internal parameters of the camera are parameters related to the characteristics of the color camera, including but not limited to data such as focal length and pixels of the color camera, and the external parameters of the camera are parameters of the color camera in a world coordinate system, including but not limited to data such as position (coordinates) of the color camera and rotation direction of the camera.

As described above, after the color images and the corresponding depth images of the target object at the same shooting time and at a plurality of different viewing angles are acquired, the target object can be three-dimensionally reconstructed from the color images and the corresponding depth images. Different from a mode of converting depth information into point cloud for three-dimensional reconstruction in the related technology, the method trains a neural network model to realize implicit expression of the three-dimensional model of the target object, and therefore three-dimensional reconstruction of the target object is realized based on the neural network model.

Optionally, the application selects a Multilayer Perceptron (MLP) that does not include a normalization layer as a base model, and trains the MLP as follows:

converting pixel points in each color image into rays based on corresponding camera parameters;

sampling a plurality of sampling points on a ray, and determining first coordinate information of each sampling point and an SDF value of each sampling point from a pixel point;

inputting the first coordinate information of the sampling points into a basic model to obtain a predicted SDF value and a predicted RGB color value of each sampling point output by the basic model;

adjusting parameters of the basic model based on a first difference between the predicted SDF value and the SDF value and a second difference between the predicted RGB color value and the RGB color value of the pixel point until a preset stop condition is met;

and taking the basic model meeting the preset stop condition as a neural network model of a three-dimensional model for implicitly expressing the target object.

Firstly, converting a pixel point in the color image into a ray based on camera parameters corresponding to the color image, wherein the ray can be a ray passing through the pixel point and being vertical to the color image surface; then, sampling a plurality of sampling points on the ray, wherein the sampling process of the sampling points can be executed in two steps, part of the sampling points can be uniformly sampled, and then the plurality of sampling points are further sampled at a key position based on the depth value of a pixel point so as to ensure that the sampling points can be sampled near the surface of the model as many as possible; then, calculating first coordinate information of each sampling point in a world coordinate system and a directed Distance (SDF) value of each sampling point according to the camera parameter and the depth value of the pixel point, wherein the SDF value can be a difference value between the depth value of the pixel point and the Distance from the sampling point to an imaging surface of the camera, the difference value is a Signed value, when the difference value is a positive value, the sampling point is represented to be outside the three-dimensional model, when the difference value is a negative value, the sampling point is represented to be inside the three-dimensional model, and when the difference value is zero, the sampling point is represented to be on the surface of the three-dimensional model; then, after sampling of the sampling points is completed and the SDF value corresponding to each sampling point is obtained through calculation, further inputting first coordinate information of the sampling points in a world coordinate system into a basic model (the basic model is configured to map the input coordinate information into the SDF value and the RGB color value and then output the SDF value and the RGB color value), recording the SDF value output by the basic model as a predicted SDF value, and recording the RGB color value output by the basic model as a predicted RGB color value; and then, adjusting parameters of the basic model based on a first difference between the predicted SDF value and the SDF value corresponding to the sampling point and a second difference between the predicted RGB color value and the RGB color value of the pixel point corresponding to the sampling point.

In addition, for other pixel points in the color image, sampling is performed according to the above manner, and then the coordinate information of the sampling point in the world coordinate system is input to the basic model to obtain the corresponding predicted SDF value and the predicted RGB color value, which are used for adjusting the parameters of the basic model until a preset stop condition is satisfied, for example, the preset stop condition may be configured such that the iteration number of the basic model reaches a preset number, or the preset stop condition is configured such that the basic model converges. And when the iteration of the basic model meets the preset stop condition, obtaining the neural network model capable of accurately and implicitly expressing the three-dimensional model of the shot object. And finally, extracting the surface of the three-dimensional model of the neural network model by adopting an isosurface extraction algorithm, thereby obtaining the three-dimensional model of the shot object.

Optionally, in some embodiments, an imaging plane of the color image is determined according to camera parameters; and determining rays which pass through the pixel points in the color image and are vertical to the imaging surface as rays corresponding to the pixel points.

The coordinate information of the color image in the world coordinate system, that is, the imaging plane, can be determined according to the camera parameters of the color camera corresponding to the color image. Then, the ray passing through the pixel point in the color image and perpendicular to the imaging plane can be determined as the ray corresponding to the pixel point.

Optionally, in some embodiments, the second coordinate information and the rotation angle of the color camera in the world coordinate system are determined according to the camera parameters; and determining an imaging surface of the color image according to the second coordinate information and the rotation angle.

Optionally, in some embodiments, a first number of first sample points are sampled equidistantly on the ray; determining a plurality of key sampling points according to the depth values of the pixel points, and sampling a second number of second sampling points according to the key sampling points; and determining a first number of first sampling points and a second number of second sampling points as a plurality of sampling points sampled on the ray.

Firstly, uniformly sampling n (namely a first number) first sampling points on a ray, wherein n is a positive integer greater than 2; then, according to the depth value of the pixel point, determining a preset number of key sampling points closest to the pixel point from the n first sampling points, or determining key sampling points which are less than a distance threshold value from the pixel point from the n first sampling points; then, sampling m second sampling points according to the determined key sampling points, wherein m is a positive integer greater than 1; and finally, determining the n + m sampling points obtained by sampling as a plurality of sampling points obtained by sampling on the ray. The m sampling points are sampled at the key sampling points, so that the training effect of the model can be more accurate on the surface of the three-dimensional model, and the reconstruction precision of the three-dimensional model is improved.

Optionally, in some embodiments, the depth value corresponding to the pixel point is determined according to the depth image corresponding to the color image; calculating the SDF value of each sampling point to the pixel point based on the depth value; and calculating the coordinate information of each sampling point according to the camera parameters and the depth values.

After sampling a plurality of sampling points on the ray corresponding to each pixel point, determining the distance between the shooting position of the color camera and the corresponding point on the target object according to the camera parameters and the depth value of the pixel point for each sampling point, then calculating the SDF value of each sampling point one by one based on the distance and calculating the coordinate information of each sampling point.

After the training of the base model is completed, for the given coordinate information of any one point, the corresponding SDF value can be predicted by the trained base model, and the predicted SDF value represents the position relationship (inside, outside or surface) between the point and the three-dimensional model of the target object, so as to implement the implicit expression of the three-dimensional model of the target object, and obtain the neural network model for implicitly expressing the three-dimensional model of the target object.

And finally, performing isosurface extraction on the neural network model, for example, drawing the surface of the three-dimensional model by adopting an isosurface extraction algorithm (MC) to obtain the surface of the three-dimensional model, and further obtaining the three-dimensional model of the target object according to the surface of the three-dimensional model.

According to the three-dimensional reconstruction scheme, the three-dimensional model of the target object is implicitly modeled through the neural network, and the depth information is added to improve the speed and the precision of model training. By adopting the three-dimensional reconstruction scheme provided by the application, the three-dimensional reconstruction is continuously carried out on the shot object in the time sequence, so that three-dimensional models of the shot object at different moments can be obtained, and the three-dimensional model sequence formed by the three-dimensional models at different moments according to the time sequence is the volume video shot by the shot object. Therefore, the volume video shooting can be carried out aiming at any shooting object, and the volume video presented by specific content is obtained. For example, a dance shooting object can be subjected to volume video shooting to obtain a dance volume video in which the dance shooting object can be watched at any angle, a teaching shooting object can be subjected to volume video shooting to obtain a teaching volume video in which the dance shooting object can be watched at any angle, and the like.

In some embodiments, after the electronic device generates the volume video from the image information, the electronic device may determine a time of the associated sound, then determine a target video frame of the same time in the volume video from the time of the associated sound, and finally save the associated sound and the sound source position into the target video frame.

For example, in the generated volume video, each video frame in the volume video corresponds to a time, if the time corresponding to the associated sound is the same as the time of one of the video frames, the video frame is determined as a target video frame, and then a mapping relation is established between the associated sound, the sound source position of the associated sound and the target video frame.

Similarly, in the whole volume video, the mapping relationship between the sound and the sound source position can be established for each frame of volume video in the above manner. When the user plays the volume video, the corresponding picture and sound of the corresponding volume video can be played through the mapping relation. Therefore, the sound and picture synchronization is realized, the sounding parts of the virtual character of the sound source in the volume video can be realized, more accurate sound source positioning is realized, and the audio immersion feeling during watching is increased.

In the embodiment of the application, the electronic equipment acquires the sound information and the image information of a shot object; determining a target image in the image information and determining position information of a target part of a shot object in the target image; determining the associated sounds corresponding to the target images at the same time in the sound information, and determining the position information of the target part in the target images as the sound source position of the associated sounds; and generating a volume video corresponding to the shot object according to the image information, and storing the related sound and the sound source position of the related sound into the volume video. Therefore, the virtual object in the video product has a specific sound source position when sounding, and therefore audiences can hear the sound at the accurate sound source position.

For a more detailed understanding of the volume video generation method provided in the embodiment of the present application, please continue to refer to fig. 3, where fig. 3 is a second flow chart of the volume video generation method provided in the embodiment of the present application. The volume video generation method may include the steps of:

201. sound information and image information of a subject are acquired.

In some embodiments, the object is in a camera array surrounded by a plurality of cameras, the plurality of cameras in the camera array can shoot the object from a plurality of angles, then the electronic device acquires shot images shot by each camera and determines the shot images as the image information of the object.

202. The face of the target image of the object is identified, and a plurality of feature points on the face of the object are acquired.

In some embodiments, the electronic device may screen out a plurality of frames of images in the image information and then determine them as the target image. For example, the target image is an image corresponding to the sound of the subject. For example, the target image may be an image of the subject whose position changes continuously. For example, the target image may be an image of a scene in which a plurality of persons and objects are present in the captured scene.

The electronic equipment can identify the face of the target image of the photographed object and acquire a plurality of feature points on the face of the photographed object. For example, the object to be photographed is a person, and the electronic device may perform face recognition on a face of the person in the target image, so as to obtain facial feature points, for example, the facial feature points include feature points corresponding to positions of a nose, glasses, a mouth, ears, eyebrows, and the like.

The electronic device may use a trained neural network model for face recognition to recognize the target image, so as to obtain feature points of a face part in the target image. For example, the target image may be cut in advance, the cut image may include only a face region, and the cut image may be input to the neural network model to be recognized, so as to acquire the feature points of the face of the subject.

Before the cut image is input into the neural network model, the cut image can be preprocessed, for example, various image parameters such as color, brightness, contrast and the like are adjusted, so that the adjusted image is more easily identified by the neural network model, and the identification accuracy of the neural network model to the feature points is improved.

203. And acquiring the position relation between every two feature points in the plurality of feature points.

For example, a two-dimensional coordinate system is established on a human face, corresponding two-dimensional coordinates are set for each feature point, and the position relationship between each two feature points is determined according to the two-dimensional coordinates of each feature point. The position relation between every two feature points can be determined, so that the error in identifying the feature points, the error in identifying the target part and the error in subsequently determining the sound source position of the shot object can be avoided.

In some embodiments, if the feature point recognition algorithm is accurate, the final target location may be determined directly according to the corresponding position of the feature point in the two-dimensional coordinate system. For example, the electronic device may directly identify a feature point corresponding to a position of the mouth, and then determine a position corresponding to the feature point as the target portion.

204. And determining target characteristic points according to the position relation, and determining target parts according to the target characteristic points.

And when the position relations of the plurality of groups of two feature points meet the position relation between the mouth feature points in the database, determining that the group of feature points are the target feature points of the shot object.

For example, the database stores the relationship of facial feature points corresponding to different biological types, for example, human facial feature points have a certain distribution rule, wherein the feature points of the mouth position also have a certain positional relationship, and the database can store the positional relationship between different feature points.

When the position relationship between the feature points of the object to be shot is matched with the database, the electronic device may determine feature points corresponding to the mouth according to the position relationship between the feature points in the database, then determine the feature points of the mouth as target feature points, and finally determine the area where the target feature points are distributed as the target part, that is, the mouth of the object to be shot.

205. And determining the corresponding two-dimensional coordinates of the target part in the target image.

The electronic device can directly determine the corresponding two-dimensional coordinates of the target part in the target image according to the two-dimensional coordinate system established in the content. For example, two-dimensional coordinates of some or all of the pixel points in the target region are determined.

206. And performing back projection calculation on the two-dimensional coordinates to determine three-dimensional coordinates of the target part in a three-dimensional space, and determining the three-dimensional coordinates as position information of the target part.

In some embodiments, the electronic device may determine a target camera corresponding to the target image in the camera array, and a preset internal reference matrix, a preset external reference translation vector, and a preset reference parameter of the target camera. And then determining the three-dimensional coordinates of the part in the three-dimensional space according to the preset internal reference matrix, the preset external reference translation vector, the preset reference parameter and the two-dimensional coordinates of the target camera. The target camera is a camera corresponding to the shot target image.

For example, a preset internal parameter matrix, a preset external parameter translation vector, a preset reference parameter and a two-dimensional coordinate of the target camera are input into a calculation formula to calculate a three-dimensional coordinate;

the calculation formula is as follows:

the method comprises the following steps of obtaining a preset reference parameter, obtaining a preset external reference rotation matrix, obtaining a preset external reference translation vector, obtaining a preset reference parameter, obtaining a two-dimensional coordinate of a pixel point of a target part, and obtaining three-dimensional coordinates of the pixel point of the target part, wherein M is a preset internal reference matrix of a target image camera, R is a preset external reference rotation matrix, T is a preset external reference translation vector, Z is a preset reference parameter, u and v are two-dimensional coordinates of the pixel point of the target part, and Xw, yw and Zw are three-dimensional coordinates of the pixel point of the target part.

It should be noted that before shooting the object to be shot, each camera in the camera array needs to be calibrated, and after each camera is calibrated, a preset internal reference matrix, a preset external reference translation vector, and a preset reference parameter of each camera can be determined.

The coordinate system comprises a world coordinate system, a camera coordinate system and an image coordinate system, and coordinate system conversion can be performed among the world coordinate system, the camera coordinate system and the image coordinate system, so that the conversion of coordinates in different coordinate systems is realized.

The world coordinate system (world coordinate) (xw, yw, zw), also called a measurement coordinate system, is a three-dimensional rectangular coordinate system, and can describe the spatial positions of the camera and the object to be measured by taking the coordinate system as a reference; a camera coordinate system (xc, yc, zc) is also a three-dimensional rectangular coordinate system, the origin is located at the optical center of the lens, the xc and yc axes are respectively parallel to two sides of the image plane, and the zc axis is the optical axis of the lens and is perpendicular to the image plane; an image coordinate system (x, y) is a two-dimensional rectangular coordinate system on an image plane. The origin of the image coordinate system is the intersection point (also called principal point) of the lens optical axis and the image plane, its x-axis is parallel to the xc axis of the camera coordinate system, and its y-axis is parallel to the yc axis of the camera coordinate system.

The calculation formula provided by the embodiment of the application is determined based on the conversion relationship between the world coordinate system and the image coordinate system.

207. And acquiring the time corresponding to the target image, and determining the associated sound at the same time in the sound information according to the time corresponding to the target image.

208. And determining the position information of the target part in the target image as the sound source position of the associated sound.

It can be understood that the target portion can be a sound production portion corresponding to the shot object, and the electronic device can determine the position information of the target portion in the target image as the sound source position of the associated sound, so that accurate positioning of the sound of the shot object can be realized, and the position of the sound can be changed according to the movement of a virtual object (the shot object) in the volume video when the volume video is played, thereby realizing more immersive sound effect experience when a user watches the volume video.

For example, when the positions of the corresponding objects in the target images in different frames are changed, the position of the target portion of the object is changed, and the position information of the target portion is changed, so that the sound source position of the associated sound is finally changed.

209. And generating a volume video corresponding to the shot object according to the image information, and storing the related sound and the sound source position of the related sound into the volume video.

In some implementations, after completing the capture of the subject, the electronic device can generate a volumetric video from the captured image information. Or the electronic device can generate the volume video according to the image information of the shot object captured in a period of time during the process that the shot object is shot.

Similarly, in the whole volume video, the mapping relationship between the sound and the sound source position can be established for each frame of volume video in the above manner. When the user plays the volume video, the corresponding picture and sound of the corresponding volume video can be played through the mapping relation. Therefore, the sound and picture synchronization is realized, the sound production positions of virtual characters of sound sources in the volume video can be realized, more accurate sound source positioning is realized, and audio immersion feeling during watching is increased.

In the embodiment of the application, the electronic equipment can acquire the sound information and the image information of the photographed object. The face of the target image of the object is identified, and a plurality of feature points on the face of the object are acquired. Then, the position relation between every two feature points in the plurality of feature points is obtained, the target feature points are determined according to the position relation, the target part is determined according to the target feature points, and the corresponding two-dimensional coordinates of the target part in the target image are determined. And performing back projection calculation on the two-dimensional coordinates to determine three-dimensional coordinates of the target part in a three-dimensional space, and determining the three-dimensional coordinates as position information of the target part.

And finally, acquiring the time corresponding to the target image, and determining the associated sound at the same time in the sound information according to the time corresponding to the target image. And determining the position information of the target part in the target image as the sound source position of the associated sound. And generating a volume video corresponding to the shot object according to the image information, and storing the related sound and the sound source position of the related sound into the volume video.

Therefore, the sound source position corresponding to the sound of the shot object is set in the volume video, the sound source position is set in the volume video, and the sound can be played through the sound source position when the volume video is played subsequently, wherein when a virtual object (the shot object) in the volume video moves, the sound source position changes accordingly, and therefore the audio immersion feeling of the volume video is increased. So that the user can sense the change of the position of the virtual object in the volume video according to the hearing sense.

Referring to fig. 4, fig. 4 is a schematic view of a scene of volume video playing according to an embodiment of the present disclosure.

In the picture shown in fig. 4, the virtual object in the volumetric video moves from a left position to a right position, for example, the virtual object in the volumetric video says one sentence on the left side, and then the virtual object goes to the right side to say another sentence.

When the virtual object speaks in the left side, the mouth position of the virtual character is the sound source position, and the virtual object speaks in the left side to make a sound from the sound source position 1. When the virtual object speaks on the right side, the mouth position of the virtual character is the sound source position, and the virtual object speaks on the right side to make a sound from the sound source position 2.

When a user wears virtual devices such as VR glasses, the visually volumetric video provides a near-real visual experience. The sound source position of the volume video is changed along with the position change of the virtual object in the volume video, and meanwhile, the user has good audio immersion.

Referring to fig. 5, fig. 5 is a schematic structural diagram of a volume video generating device according to an embodiment of the present disclosure. The volume video generating apparatus 300 may include:

the acquiring module 310 is used for acquiring sound information and image information of the photographed object.

The acquiring module 310 is further configured to control, in the camera array, a plurality of cameras in the camera array to acquire image information and a microphone to acquire sound information of the subject when the subject starts to be photographed.

The first determining module 320 is configured to determine a target image from the image information, and determine position information of a target portion of a photographed object in the target image.

The first determination module 320 is further used for identifying the face of the object of the target image to determine a target part, wherein the target part comprises a mouth;

determining a corresponding two-dimensional coordinate of the target part in the target image;

and determining the three-dimensional coordinates of the target part in the three-dimensional space according to the two-dimensional coordinates, and determining the three-dimensional coordinates as the position information of the target part.

The first determining module 320 is further configured to perform a back projection calculation on the two-dimensional coordinates to determine three-dimensional coordinates of the target portion in a three-dimensional space.

The first determining module 320 is further configured to determine a target camera corresponding to the target image in the camera array, and a preset internal reference matrix, a preset external reference translation vector, and a preset reference parameter of the target camera;

and determining the three-dimensional coordinates of the part in the three-dimensional space according to the preset internal reference matrix, the preset external reference translation vector, the preset reference parameter and the two-dimensional coordinates of the target camera.

The first determining module 320 is further configured to input a preset internal reference matrix, a preset external reference translation vector, a preset reference parameter, and a two-dimensional coordinate of the target camera into a calculation formula to calculate a three-dimensional coordinate;

the calculation formula is as follows:

A first determining module 320, configured to obtain a plurality of feature points on the face of the subject;

and determining target feature points from the plurality of feature points, and determining a target part according to the target feature points.

The first determining module 320 is further configured to obtain a position relationship between each two feature points in the plurality of feature points;

and determining the target characteristic points according to the position relation.

The second determining module 330 is configured to determine, in the sound information, associated sounds corresponding to the target image at the same time, and determine the position information of the target portion in the target image as the sound source position of the associated sounds.

The second determining module 330 is further configured to obtain a time corresponding to the target image;

and determining the associated sounds at the same time in the sound information according to the time corresponding to the target image.

And the generating module 340 is configured to generate a volume video corresponding to the captured object according to the image information, and store the associated sound and the sound source position of the associated sound in the volume video.

A generation module 340, further configured to determine a time of the associated sound;

determining a target video frame of the same time in the volume video according to the time of the associated sound;

and storing the associated sound and the sound source position into the target video frame.

Accordingly, an electronic device 400 may include, as shown in fig. 6, a memory 401 including one or more computer-readable storage media, an input unit 402, a display unit 403, a sensor 404, a processor 405 including one or more processing cores, and a power supply 406. Those skilled in the art will appreciate that the electronic device configuration shown in fig. 6 does not constitute a limitation of the electronic device and may include more or fewer components than those shown, or some components may be combined, or a different arrangement of components. Wherein:

the memory 401 may be used to store software programs and modules, and the processor 405 executes various functional applications and data processing by operating the software programs and modules stored in the memory 401. The memory 401 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required by at least one function (such as a sound playing function, an image playing function, etc.), and the like; the storage data area may store data (such as audio data, a phonebook, etc.) created according to the use of the electronic device, and the like. Further, the memory 401 may include high speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid state storage device. Accordingly, the memory 401 may further include a memory controller to provide the processor 405 and the input unit 402 with access to the memory 401.

The input unit 402 may be used to receive input numeric or character information and generate keyboard, mouse, joystick, optical or trackball signal inputs related to user settings and function control. In particular, in one particular embodiment, input unit 402 may include a touch-sensitive surface as well as other input devices. The touch-sensitive surface, also referred to as a touch display screen or a touch pad, may collect touch operations by a user (such as operations by the user on or near the touch-sensitive surface using a finger, a stylus, or any other suitable object or attachment) thereon or nearby, and drive the corresponding connection device according to a predetermined program. Alternatively, the touch sensitive surface may comprise two parts, a touch detection means and a touch controller. The touch detection device detects the touch direction of a user, detects a signal brought by touch operation and transmits the signal to the touch controller; the touch controller receives touch information from the touch sensing device, converts the touch information into touch point coordinates, sends the touch point coordinates to the processor 405, and can receive and execute commands sent by the processor 405. In addition, the touch sensitive surface can be implemented in various types, such as resistive, capacitive, infrared, and surface acoustic wave. The input unit 402 may include other input devices in addition to a touch-sensitive surface. In particular, other input devices may include, but are not limited to, one or more of a physical keyboard, function keys (such as volume control keys, switch keys, etc.), a trackball, a mouse, a joystick, and the like.

The display unit 403 may be used to display information input by or provided to a user and various graphical user interfaces of the electronic device, which may be made up of graphics, text, icons, video, and any combination thereof. The Display unit 403 may include a Display panel, and optionally, the Display panel may be configured in the form of a Liquid Crystal Display (LCD), an Organic Light-Emitting Diode (OLED), or the like. Further, the touch-sensitive surface may overlay the display panel, and when a touch operation is detected on or near the touch-sensitive surface, the touch operation is transmitted to the processor 405 to determine the type of touch event, and the processor 405 then provides a corresponding visual output on the display panel according to the type of touch event. Although in FIG. 6 the touch-sensitive surface and the display panel are two separate components to implement input and output functions, in some embodiments the touch-sensitive surface may be integrated with the display panel to implement input and output functions.

The electronic device may also include at least one sensor 404, such as a light sensor, motion sensor, and other sensors. Specifically, the light sensor may include an ambient light sensor that adjusts the brightness of the display panel according to the brightness of ambient light, and a proximity sensor that turns off the display panel and/or the backlight when the electronic device is moved to the ear. As one of the motion sensors, the gravity acceleration sensor can detect the magnitude of acceleration in each direction (generally, three axes), detect the magnitude and direction of gravity when the device is stationary, and can be used for applications (such as horizontal and vertical screen switching, related games, magnetometer attitude calibration) for recognizing the attitude of an electronic device, and related functions (such as pedometer and tapping) for vibration recognition; as for other sensors such as a gyroscope, a barometer, a hygrometer, a thermometer, and an infrared sensor, which may be further configured to the electronic device, detailed descriptions thereof are omitted.

The processor 405 is a control center of the electronic device, connects various parts of the whole electronic device by using various interfaces and lines, performs various functions of the electronic device and processes data by operating or executing software programs and/or modules stored in the memory 401 and calling data stored in the memory 401, thereby performing overall monitoring of the electronic device. Optionally, processor 405 may include one or more processing cores; preferably, the processor 405 may integrate an application processor, which mainly handles operating systems, user interfaces, application programs, etc., and a modem processor, which mainly handles wireless communications. It will be appreciated that the modem processor described above may not be integrated into the processor 405.

The electronic device also includes a power supply 406 (e.g., a battery) for powering the various components, which may preferably be logically coupled to the processor 405 via a power management system, such that functions such as managing charging, discharging, and power consumption are performed via the power management system. The power supply 406 may also include any component of one or more dc or ac power sources, recharging systems, power failure detection circuitry, power converters or inverters, power status indicators, and the like.

Although not shown, the electronic device may further include a camera, a bluetooth module, and the like, which are not described in detail herein. Specifically, in this embodiment, the processor 405 in the electronic device loads the computer program stored in the memory 401, and the processor 405 implements various functions in the volume video generation method by loading the computer program:

acquiring sound information and image information of a shot object;

determining the associated sounds corresponding to the target image at the same time in the sound information, and determining the position information of the target part in the target image as the sound source position of the associated sounds;

It will be understood by those skilled in the art that all or part of the steps of the methods of the above embodiments may be performed by instructions or by associated hardware controlled by the instructions, which may be stored in a computer readable storage medium and loaded and executed by a processor.

To this end, the present application provides a computer-readable storage medium, in which a plurality of instructions are stored, where the instructions can be loaded by a processor to execute the steps in any one of the volumetric video generation methods provided in the embodiments of the present application. For example, the instructions may perform the steps of:

acquiring sound information and image information of a shot object;

The above operations can be implemented in the foregoing embodiments, and are not described in detail herein.

Wherein the storage medium may include: read Only Memory (ROM), random Access Memory (RAM), magnetic or optical disks, and the like.

Since the instructions stored in the storage medium may execute the steps in any one of the volume video generation methods or the image processing methods provided in the embodiments of the present application, beneficial effects that can be achieved by any one of the volume video generation methods or the image processing methods provided in the embodiments of the present application may be achieved, for details, see the foregoing embodiments, and are not described herein again.

The volume video generation method, the volume video generation device, the electronic device, and the storage medium provided in the embodiments of the present application are described in detail above, and a specific example is applied in the present application to explain the principle and the implementation of the present application, and the description of the above embodiments is only used to help understand the method and the core idea of the present application; meanwhile, for those skilled in the art, according to the idea of the present application, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present application.

Claims

1. A method of volumetric video generation, comprising:

acquiring sound information and image information of a shot object;

determining a target image in the image information and determining position information of a target part of the shot object in the target image;

determining the associated sound corresponding to the target image at the same time in the sound information, and determining the position information of the target part in the target image as the sound source position of the associated sound;

and generating a volume video corresponding to the shot object according to the image information, and storing the associated sound and the sound source position of the associated sound into the volume video.

2. The method according to claim 1, wherein the acquiring sound information and image information of the subject includes:

in the camera array, when shooting of the shot object is started, a plurality of cameras in the camera array are controlled to acquire the image information, and a microphone is controlled to acquire sound information of the shot object.

3. The method for generating a volume video according to claim 1, wherein the target image is a two-dimensional image, and the determining the position information of the target portion of the subject in the target image includes:

recognizing the face of the subject of the target image to determine the target part, wherein the target part comprises a mouth;

determining the corresponding two-dimensional coordinates of the target part in the target image;

and determining the three-dimensional coordinate of the target part in a three-dimensional space according to the two-dimensional coordinate, and determining the three-dimensional coordinate as the position information of the target part.

4. The method for generating the volume video according to claim 3, wherein said determining the three-dimensional coordinates of the target portion in the three-dimensional space according to the two-dimensional coordinates comprises:

and carrying out back projection calculation on the two-dimensional coordinates to determine the three-dimensional coordinates of the target part in the three-dimensional space.

5. The method of generating volumetric video of claim 4, wherein the back-projecting the two-dimensional coordinates to determine the three-dimensional coordinates of the target portion in three-dimensional space comprises:

determining a target camera corresponding to the target image in the camera array, and a preset internal reference matrix, a preset external reference translation vector and a preset reference parameter of the target camera;

and determining the three-dimensional coordinate of the part in the three-dimensional space according to the preset internal reference matrix, the preset external reference translation vector, the preset reference parameter and the two-dimensional coordinate of the target camera.

6. The method for generating a volume video according to claim 5, wherein the determining the three-dimensional coordinates of the portion in the three-dimensional space according to a preset internal reference matrix, a preset external reference translation vector, a preset reference parameter of the target camera, and the two-dimensional coordinates comprises:

inputting a preset internal reference matrix, a preset external reference translation vector, a preset reference parameter and the two-dimensional coordinate of the target camera into a calculation formula to calculate the three-dimensional coordinate;

the calculation formula is as follows:

the method comprises the following steps of obtaining a target image camera, obtaining a preset internal reference matrix of the target image camera, obtaining a preset external reference rotation matrix of the target image camera, obtaining a preset external reference translation vector of the target image camera, obtaining a preset reference parameter of the target image camera, obtaining a preset external reference translation vector of the target image camera, obtaining a preset reference parameter of the target image camera, obtaining two-dimensional coordinates of pixel points of the target portion by using u and v, and obtaining three-dimensional coordinates of the pixel points of the target portion by using Xw, yw and Zw.

7. The method for generating a volume video according to claim 3, wherein the recognizing the face of the subject in the target image to determine the target portion comprises:

acquiring a plurality of feature points on the face of the subject;

and determining target feature points from the plurality of feature points, and determining the target part according to the target feature points.

8. The method for generating a volumetric video according to claim 7, wherein the determining a target feature point among the plurality of feature points comprises:

acquiring the position relation between every two feature points in the plurality of feature points;

and determining the target feature points according to the position relation.

9. The method for generating a volume video according to claim 1, wherein the determining the associated sound corresponding to the target image at the same time in the sound information comprises:

acquiring time corresponding to the target image;

10. The method according to claim 1, wherein the saving the associated sound and the sound source position of the associated sound into the volume video comprises:

determining a time of the associated sound;

saving the associated sound and the source position into the target video frame.

11. A volumetric video generation apparatus, comprising:

the first determining module is used for determining a target image in the image information and determining the position information of a target part of the shot object in the target image;

the second determining module is used for determining the associated sound corresponding to the target image at the same time in the sound information and determining the position information of the target part in the target image as the sound source position of the associated sound;

12. An electronic device, comprising:

a memory storing executable program code, a processor coupled with the memory;

the processor calls the executable program code stored in the memory to perform the steps in the volumetric video generation method according to any of claims 1 to 10.

13. A storage medium storing a plurality of instructions adapted to be loaded by a processor to perform the steps of the volumetric video generation method according to any one of claims 1 to 10.