CN115457176A

CN115457176A - Image generation method and device, electronic equipment and storage medium

Info

Publication number: CN115457176A
Application number: CN202211168096.3A
Authority: CN
Inventors: 保俊杉; 龙翔
Original assignee: Beijing QIYI Century Science and Technology Co Ltd
Current assignee: Beijing QIYI Century Science and Technology Co Ltd
Priority date: 2022-09-23
Filing date: 2022-09-23
Publication date: 2022-12-09

Abstract

The embodiment of the invention provides an image generation method, an image generation device, electronic equipment and a storage medium, and relates to the technical field of image processing, wherein video frames with the same timestamp are respectively acquired from original videos of a target scene to obtain a video frame group; performing gesture recognition on each video frame in the video frame group to obtain two-dimensional coordinates of each joint point of the target object in the video frame; for each joint point of the target object, calculating a target three-dimensional coordinate of the joint point corresponding to the video frame group in the target scene based on the two-dimensional coordinate of the joint point in each video frame in the video frame group and each conversion relation between the image coordinate system of each video frame in the video frame group and the three-dimensional coordinate system of the target scene; and adjusting the three-dimensional coordinates of the joint points of the virtual object in each preset image according to the target three-dimensional coordinates of the joint points of the target object corresponding to each video frame group in the target scene to obtain the target video so as to improve the video generation efficiency.

Description

Image generation method and device, electronic equipment and storage medium

Technical Field

The present invention relates to the field of image processing technologies, and in particular, to an image generation method and apparatus, an electronic device, and a storage medium.

Background

In making a video including a virtual object (e.g., a virtual character, a virtual animal, etc.), a device for motion capture, such as an optical capture device, an inertial capture device, or the like, is worn by the character, capturing pose data when the character makes a specified motion, and capturing a video image when the character makes the specified motion. And then, adjusting the posture of the virtual object in the preset image according to the collected posture data of the image and the person to obtain a target video containing the virtual object for making the designated action.

However, the cost of the apparatus for motion capture is high, and photographing needs to be repeated a plurality of times when capturing the posture data of a person. Therefore, in the above process, a large time cost and a large labor cost are required, which results in a low efficiency of generating a video in the related art.

Disclosure of Invention

The embodiment of the invention aims to provide an image generation method, an image generation device, electronic equipment and a storage medium, so as to improve the efficiency of video generation. The specific technical scheme is as follows:

in a first aspect of the present invention, there is provided an image generating method, including:

acquiring original videos of a target scene under multiple visual angles, and acquiring multiple video frames with the same timestamp from each original video to obtain a video frame group; wherein; the time stamp of one video frame represents the position of the video frame in the original video;

aiming at each video frame in the video frame group, carrying out gesture recognition on the video frame to obtain two-dimensional coordinates of each joint point of a target object in the video frame;

for each joint point of the target object, calculating three-dimensional coordinates of the joint point corresponding to the video frame group in the target scene as target three-dimensional coordinates based on two-dimensional coordinates of the joint point in each video frame in the video frame group and each conversion relation between an image coordinate system of each video frame in the video frame group and a three-dimensional coordinate system of the target scene;

and adjusting the three-dimensional coordinates of all joint points of the virtual object in each preset image according to the target three-dimensional coordinates of all joint points of the target object in the target scene corresponding to all video frame groups to obtain the target video containing the virtual object with the same action as the target object.

Optionally, before each joint point for the target object, based on two-dimensional coordinates of the joint point in each video frame in the video frame group and each conversion relationship between an image coordinate system of each video frame in the video frame group and a three-dimensional coordinate system of the target scene, calculating three-dimensional coordinates of the joint point corresponding to the video frame group in the target scene, as target three-dimensional coordinates, the method further includes:

the same target object in each video frame in the video frame group is determined based on two-dimensional coordinates of joint points of the plurality of target objects in each video frame in the video frame group.

Optionally, the determining, based on two-dimensional coordinates of joint points of multiple target objects in each video frame in the video frame group, the same target object in each video frame in the video frame group includes:

calculating epipolar line distances from all joint points of the target object to the corresponding epipolar line plane for each target object, and calculating the mean value of all epipolar line distances corresponding to all joint points of the target object to obtain the mean value of the distances corresponding to the target object; wherein, an epipolar line plane corresponding to one joint point represents a plane where the joint point is located in the target scene;

calculating the similarity of two target objects in the two video frames based on the distance mean value corresponding to each two target objects in the two video frames to obtain a first similarity matrix; wherein one element in the first similarity matrix represents: the probability that two corresponding target objects in the two video frames are the same;

inputting the two video frames into a pre-trained object matching model to obtain the similarity of every two target objects in the two video frames and obtain a second similarity matrix; wherein an element in the second similarity matrix represents: the probability that two corresponding target objects in the two video frames are the same;

fusing the first similarity matrix and the second similarity matrix to obtain a target similarity matrix;

and determining the same target object in each video frame in the video frame group based on the target similarity matrix.

Optionally, the calculating, for each joint point of the target object, a three-dimensional coordinate of the joint point in the target scene corresponding to the video frame group based on two-dimensional coordinates of the joint point in each video frame in the video frame group and each conversion relationship between an image coordinate system of each video frame in the video frame group and a three-dimensional coordinate system of the target scene as a target three-dimensional coordinate includes:

for each joint point of the target object, calculating a three-dimensional coordinate of the joint point in the target scene as a three-dimensional coordinate to be processed based on two-dimensional coordinates of the joint point in each two video frames in the video frame group and each conversion relation between an image coordinate system of the two video frames and a three-dimensional coordinate system of the target scene; calculating the average value of each three-dimensional coordinate to be processed as the initial three-dimensional coordinate of the joint point corresponding to the video frame group in the target scene;

and calculating the target three-dimensional coordinates of the joint point corresponding to the video frame group in the target scene based on the initial three-dimensional coordinates of the joint point corresponding to the video frame group in the target scene.

Optionally, the performing gesture recognition on each video frame in the video frame group to obtain two-dimensional coordinates of each joint point of the target object in the video frame includes:

inputting the video frame into a pre-trained gesture recognition model aiming at each video frame in the video frame group to obtain two-dimensional coordinates and corresponding confidence degrees of all joint points of the target object in the video frame; wherein, the confidence corresponding to the two-dimensional coordinates of a joint point represents: a probability that the joint point is located at a position represented by the two-dimensional coordinates in the video frame;

the calculating the target three-dimensional coordinates of the joint point corresponding to the video frame group in the target scene based on the initial three-dimensional coordinates of the joint point corresponding to the video frame group in the target scene includes:

and calculating the target three-dimensional coordinates of the joint point corresponding to the video frame group in the target scene based on the confidence degree corresponding to the two-dimensional coordinates of the joint point in each video frame in the video frame group and the initial three-dimensional coordinates of the joint point corresponding to the video frame group in the target scene.

Optionally, the calculating, based on the confidence corresponding to the two-dimensional coordinate of the joint point in each video frame in the video frame group and the initial three-dimensional coordinate of the joint point corresponding to the video frame group in the target scene, the target three-dimensional coordinate of the joint point corresponding to the video frame group in the target scene includes:

selecting one joint point from all joint points of the target object as a father joint point;

calculating the mean value of confidence coefficients corresponding to two-dimensional coordinates of each child joint point in each video frame in the video frame group as a first mean confidence coefficient aiming at each child joint point of the parent joint point; wherein child joint points of the parent joint point include: the joint point connected with the father joint point in all joint points of the target object;

if the first average confidence coefficient is smaller than a preset threshold value, calculating an offset value of the target three-dimensional coordinate of the child joint point corresponding to the first video frame group relative to the target three-dimensional coordinate of the parent joint point corresponding to the first video frame group as a first offset value; wherein the first video frame group includes: each original video is positioned in front of each video frame in the video frame group and is positioned in the same position; the target three-dimensional coordinates of the joint points corresponding to one video frame group are as follows: determined based on the two-dimensional coordinates of the joint point in each video in the video frame group;

calculating an offset value of the target three-dimensional coordinate of the child joint point corresponding to the second video frame group relative to the target three-dimensional coordinate of the father joint point corresponding to the second video frame group as a second offset value; wherein the second group of video frames comprises: each original video is positioned behind each video frame in the video frame group and is positioned at the same position;

calculating the target three-dimensional coordinates of the child joint point corresponding to the video frame group in the target scene based on the first offset value, the second offset value and the target three-dimensional coordinates of the parent joint point corresponding to the video frame group;

and if the first average confidence coefficient is not less than the preset threshold value, acquiring the initial three-dimensional coordinates of the sub-joint point corresponding to the video frame group in the target scene to obtain the target three-dimensional coordinates of the sub-joint point corresponding to the video frame group in the target scene.

Optionally, after the calculating, based on the first offset value, the second offset value, and the target three-dimensional coordinate of the parent joint point corresponding to the video frame group, the target three-dimensional coordinate of the child joint point corresponding to the video frame group in the target scene, the method further includes:

and taking the child joint point of the target object as a parent joint point, returning and executing each child joint point aiming at the parent joint point, calculating the average value of confidence degrees corresponding to two-dimensional coordinates of the child joint point in each video frame in the video frame group as a first average confidence degree until the target three-dimensional coordinates of each joint point of the target object in the target scene corresponding to the video frame group are obtained.

Optionally, the calculating, based on the first offset value, the second offset value, and the target three-dimensional coordinate of the parent joint point corresponding to the video frame group, the target three-dimensional coordinate of the child joint point corresponding to the video frame group in the target scene includes:

calculating a mean value of the first offset value and the second offset value as an average offset value;

and calculating the sum of the average deviation value and the target three-dimensional coordinates of the father joint point corresponding to the video frame group to obtain the target three-dimensional coordinates of the child joint point corresponding to the video frame group in the target scene.

Optionally, before selecting one joint from the joints of the target object as a parent joint, the method further includes:

determining a designated central joint point from the joint points of the target object;

calculating the mean value of confidence degrees corresponding to the two-dimensional coordinates of the central joint point in each video frame in the video frame group as a second mean confidence degree;

if the second average confidence coefficient is smaller than a preset threshold value, calculating the target three-dimensional coordinates of the central joint point corresponding to a third video frame group and the average value of the target three-dimensional coordinates of the central joint point corresponding to a fourth video frame to obtain the target three-dimensional coordinates of the central node corresponding to the video frame group in the target scene; wherein the third group of video frames comprises: each original video is positioned in front of each video frame contained in the video frame group, and each video frame with the same position is positioned in the original video; the fourth group of video frames includes: each original video is positioned behind each video frame contained in the video frame and is positioned in the same position;

if the second average confidence coefficient is not smaller than the preset threshold value, acquiring an initial three-dimensional coordinate of the central joint point corresponding to the video frame group to obtain a target three-dimensional coordinate of the central joint point corresponding to the video frame group in the target scene;

the selecting one joint point from the joint points of the target object as a parent joint point comprises:

and determining the central joint point as a parent joint point from all joint points of the target object.

In a second aspect of the present invention, there is also provided an image generating apparatus, comprising:

the system comprises an acquisition module, a processing module and a display module, wherein the acquisition module is used for acquiring original videos of a target scene under multiple visual angles, and acquiring multiple video frames with the same timestamp from each original video to obtain a video frame group; wherein; the time stamp of one video frame represents the position of the video frame in the original video;

the recognition module is used for recognizing the gesture of each video frame in the video frame group to obtain the two-dimensional coordinates of each joint point of the target object in the video frame;

a first determining module, configured to calculate, for each joint point of the target object, a three-dimensional coordinate of the joint point in the target scene corresponding to the video frame group as a target three-dimensional coordinate based on two-dimensional coordinates of the joint point in each video frame in the video frame group and each conversion relationship between an image coordinate system of each video frame in the video frame group and a three-dimensional coordinate system of the target scene;

and the generating module is used for adjusting the three-dimensional coordinates of all joint points of the virtual object in each preset image according to the target three-dimensional coordinates of all joint points of the target object in the target scene corresponding to all video frame groups to obtain the target video containing the virtual object and the target object with the same action.

Optionally, the apparatus further comprises:

and the matching module is used for executing each joint point aiming at the target object at the first determining module, calculating the three-dimensional coordinates of the joint point corresponding to the video frame group in the target scene based on the two-dimensional coordinates of the joint point in each video frame in the video frame group and the conversion relation between the image coordinate system of each video frame in the video frame group and the three-dimensional coordinate system of the target scene, and determining the same target object in each video frame in the video frame group based on the two-dimensional coordinates of the joint points of a plurality of target objects in each video frame in the video frame group before the target three-dimensional coordinates are used as the two-dimensional coordinates.

Optionally, the matching module is specifically configured to calculate, for each target object, an epipolar line distance from each joint point of the target object to the corresponding epipolar line plane, and calculate a mean value of each epipolar line distance corresponding to each joint point of the target object, to obtain a distance mean value corresponding to the target object; wherein, the epipolar line plane corresponding to one joint point represents the plane of the joint point in the target scene;

for every two video frames in the video frame group, calculating the similarity of every two target objects based on the distance mean value corresponding to every two target objects in the two video frames to obtain a first similarity matrix; wherein one element in the first similarity matrix represents: the probability that the corresponding two target objects in the two video frames are the same;

inputting the two video frames into a pre-trained object matching model to obtain the similarity of every two target objects in the two video frames and obtain a second similarity matrix; wherein an element in the second similarity matrix represents: the probability that the corresponding two target objects in the two video frames are the same;

Optionally, the first determining module is specifically configured to, for each joint point of the target object, calculate, based on two-dimensional coordinates of the joint point in each two video frames in the video frame group and each conversion relationship between image coordinate systems of the two video frames and a three-dimensional coordinate system of the target scene, a three-dimensional coordinate of the joint point in the target scene as a to-be-processed three-dimensional coordinate; calculating the average value of each three-dimensional coordinate to be processed as the initial three-dimensional coordinate of the joint point corresponding to the video frame group in the target scene;

Optionally, the recognition module is specifically configured to, for each video frame in the video frame group, input the video frame into a pre-trained gesture recognition model to obtain two-dimensional coordinates and a corresponding confidence of each joint point of the target object in the video frame; wherein, the confidence corresponding to the two-dimensional coordinates of a joint point represents: a probability that the joint point is located at a position represented by the two-dimensional coordinates in the video frame;

the first determining module is specifically configured to calculate, based on a confidence degree corresponding to a two-dimensional coordinate of the joint point in each video frame in the video frame group and an initial three-dimensional coordinate of the joint point corresponding to the video frame group in the target scene, a target three-dimensional coordinate of the joint point corresponding to the video frame group in the target scene.

Optionally, the first determining module is specifically configured to select one joint point from the joint points of the target object as a parent joint point;

if the first average confidence coefficient is smaller than a preset threshold value, calculating an offset value of the target three-dimensional coordinate of the child joint point corresponding to the first video frame group relative to the target three-dimensional coordinate of the parent joint point corresponding to the first video frame group as a first offset value; wherein the first video frame group includes: the original videos are positioned before the video frames in the video frame group and have the same position

Each video frame of (a); the target three-dimensional coordinates of the joint points corresponding to one video frame group are as follows: the joint point is determined based on the two-dimensional coordinates of the joint point in each video in the video frame group;

calculating an offset value of the target three-dimensional coordinate of the child joint point corresponding to the second video frame group relative to the target three-dimensional coordinate of the parent joint point corresponding to the second video frame group as a second offset value; wherein the second group of video frames comprises: each original video is positioned behind each video frame in the video frame group and is positioned at the same position;

Optionally, the apparatus further comprises:

and a processing module, configured to, after the first determining module performs calculation of a target three-dimensional coordinate of the child joint point corresponding to the video frame group in the target scene based on the first offset value, the second offset value, and the target three-dimensional coordinate of the parent joint point corresponding to the video frame group, perform a step of taking the child joint point of the target object as a parent joint point, and trigger the first determining module to perform the step of calculating, for each child joint point of the parent joint point, an average value of confidence degrees corresponding to two-dimensional coordinates of the child joint point in each video frame in the video frame group, which is taken as a first average confidence degree, until a target three-dimensional coordinate of each joint point of the target object corresponding to the video frame group in the target scene is obtained.

Optionally, the first determining module is specifically configured to calculate a mean value of the first offset value and the second offset value as an average offset value;

Optionally, the apparatus further comprises:

a second determining module, configured to determine a specified central joint point from the joint points of the target object before the first determining module selects one joint point from the joint points of the target object to serve as a parent joint point;

a third determining module, configured to calculate an average value of confidence degrees corresponding to two-dimensional coordinates of the central joint point in each video frame in the video frame group, as a second average confidence degree;

a fourth determining module, configured to calculate a target three-dimensional coordinate of the central joint point corresponding to a third video frame group and an average value of the target three-dimensional coordinates of the central joint point corresponding to a fourth video frame if the second average confidence is smaller than a preset threshold, so as to obtain a target three-dimensional coordinate of the central node corresponding to the video frame group in the target scene; wherein the third video frame group comprises: each original video is positioned in front of each video frame contained in the video frame group and is positioned in the same position; the fourth group of video frames includes: each original video is positioned behind each video frame contained in the video frame and is positioned in the same position;

a fifth determining module, configured to, if the second average confidence is not smaller than the preset threshold, obtain an initial three-dimensional coordinate of the central joint point corresponding to the video frame group, and obtain a target three-dimensional coordinate of the central joint point corresponding to the video frame group in the target scene;

the first determining module determines the central joint point as a parent joint point from the joint points of the target object.

In another aspect of the present invention, there is also provided an electronic device, including a processor, a communication interface, a memory and a communication bus, where the processor, the communication interface, and the memory complete communication with each other through the communication bus;

a memory for storing a computer program;

a processor for implementing any of the image generation method steps described above when executing a program stored in the memory.

In yet another aspect of the present invention, there is also provided a computer-readable storage medium having stored therein a computer program which, when executed by a processor, implements any of the image generation method steps described above.

In yet another aspect of the present invention, there is also provided a computer program product containing instructions which, when run on a computer, cause the computer to perform any of the image generation methods described above.

The image generation method provided by the embodiment of the invention comprises the steps of acquiring original videos of a target scene under multiple visual angles, and acquiring multiple video frames with the same timestamp from each original video to obtain a video frame group; the time stamp of one video frame represents the position of the video frame in the original video; performing gesture recognition on each video frame in the video frame group to obtain two-dimensional coordinates of each joint point of the target object in the video frame; for each joint point of a target object, calculating three-dimensional coordinates of the joint point corresponding to the video frame group in a target scene as target three-dimensional coordinates based on two-dimensional coordinates of the joint point in each video frame in the video frame group and conversion relations between image coordinate systems of the video frames in the video frame group and three-dimensional coordinate systems of the target scene; and adjusting the three-dimensional coordinates of all joint points of the virtual object in each preset image according to the target three-dimensional coordinates of all joint points of the target object corresponding to all the video frame groups in the target scene to obtain the target video containing the virtual object with the same action as the target object in each original video.

Based on the above processing, the target three-dimensional coordinates of each joint point of the target object in the target scene can be determined through the two-dimensional coordinates of the target object in each video frame under multiple viewing angles and each conversion relationship between the image coordinate system of each video frame and the three-dimensional coordinate system of the target scene. The target three-dimensional coordinates of the joints of a target object in the target scene may represent the three-dimensional pose of the target object, i.e. the three-dimensional pose of the target object may be determined without the target object wearing a device for motion capture. Furthermore, the target video is generated based on the three-dimensional posture of the target object, so that the time cost and the labor cost for generating the video can be reduced, and the video generation efficiency is improved.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below.

Fig. 1 is a flowchart of an image generation method provided in an embodiment of the present invention;

fig. 2 is a schematic diagram of a camera calibration provided in an embodiment of the present invention;

FIG. 3 (a) is a diagram illustrating an articulation point of a target object according to an embodiment of the present invention;

FIG. 3 (b) is a diagram illustrating an articulation point of another target object provided in an embodiment of the present invention;

fig. 4 (a) is a schematic diagram of a video frame including a target scene provided in an embodiment of the present invention;

fig. 4 (b) is a schematic diagram of another video frame including a target scene provided in the embodiment of the present invention;

FIG. 5 is a flow chart of another image generation method provided in embodiments of the present invention;

FIG. 6 is a flow chart of another image generation method provided in embodiments of the present invention;

FIG. 7 is a schematic diagram of a target similarity matrix according to an embodiment of the present invention;

FIG. 8 is a flow chart of another image generation method provided in embodiments of the present invention;

FIG. 9 is a flow chart of another image generation method provided in embodiments of the present invention;

FIG. 10 is a flow chart of another image generation method provided in embodiments of the present invention;

FIG. 11 (a) is a diagram illustrating an articulation point of a target object according to an embodiment of the present invention;

FIG. 11 (b) is a diagram illustrating an articulation point of another target object provided in an embodiment of the present invention;

FIG. 11 (c) is a diagram illustrating an articulation point of another target object provided in an embodiment of the present invention;

FIG. 11 (d) is a diagram illustrating an articulation point of another target object provided in an embodiment of the present invention;

fig. 12 is a flowchart of a method for determining target three-dimensional coordinates of a target object according to an embodiment of the present invention;

fig. 13 is a flowchart of a target three-dimensional coordinate determination method of another target object provided in the embodiment of the present invention;

fig. 14 is a structural diagram of an image generating apparatus provided in an embodiment of the present invention;

fig. 15 is a structural diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be described below with reference to the drawings in the embodiments of the present invention.

In the related art, when a video including a virtual object (e.g., a virtual character, a virtual animal, etc.) is made, a device for motion capture is worn by the character, posture data when the character makes a specified motion is captured, and a video image when the character makes the specified motion is captured. And then, adjusting the posture of the virtual object in the preset image according to the collected posture data of the image and the person to obtain a target video containing the virtual object for making the designated action. Since the cost of the apparatus for motion capture is high and photographing needs to be repeated several times when capturing the posture data of a person. Therefore, in the above process, a large time cost and a large labor cost are required, which results in a low efficiency of generating a video in the related art.

In order to solve the above problem, an embodiment of the present invention provides an image generation method, which is applied to an electronic device, where the electronic device may be a server. The electronic device can acquire each original video of a target scene under a plurality of visual angles, and determine target three-dimensional coordinates of each joint point of the target object in the target scene based on two-dimensional coordinates of each joint point of the target object in each video frame contained in each original video, wherein the target three-dimensional coordinates of each joint point of the target object in the target scene can represent the three-dimensional posture of the target object, namely, the three-dimensional posture of the target object can be determined without the target object wearing a device for motion capture. Furthermore, the target video is generated based on the three-dimensional posture of the target object, so that the time cost and the labor cost for generating the video can be reduced, and the video generation efficiency is improved.

Referring to fig. 1, fig. 1 is a flowchart of an image generation method according to an embodiment of the present invention, where the method may include the following steps:

s101: the method comprises the steps of obtaining original videos of a target scene under multiple visual angles, and obtaining multiple video frames with the same timestamp from the original videos respectively to obtain a video frame group.

Wherein; the timestamp of a video frame indicates the position of the video frame in the original video to which the video frame belongs.

S102: and performing gesture recognition on each video frame in the video frame group to obtain two-dimensional coordinates of each joint point of the target object in the video frame.

S103: and calculating the three-dimensional coordinates of the joint points corresponding to the video frame group in the target scene as target three-dimensional coordinates based on the two-dimensional coordinates of the joint points in the video frames in the video frame group and the conversion relations between the image coordinate systems of the video frames in the video frame group and the three-dimensional coordinate systems of the target scene.

S104: and adjusting the three-dimensional coordinates of all joint points of the virtual object in each preset image according to the target three-dimensional coordinates of all joint points of the target object corresponding to all video frame groups in the target scene to obtain the target video containing the virtual object and the target object with the same action.

Based on the image generation method provided by the embodiment of the invention, the target three-dimensional coordinates of each joint point of the target object in the target scene can be determined through the two-dimensional coordinates of the target object in each video frame under multiple visual angles and each conversion relation between the image coordinate system of each video frame and the three-dimensional coordinate system of the target scene. The target three-dimensional coordinates of the joints of a target object in the target scene may represent the three-dimensional pose of the target object, i.e. the three-dimensional pose of the target object may be determined without the target object wearing a device for motion capture. Furthermore, the target video is generated based on the three-dimensional posture of the target object, so that the time cost and the labor cost for generating the video can be reduced, and the video generation efficiency is improved.

For step S101, the target scene may be any scene containing a target object, and the target object may be a person or the like. For example, the target scene may be a stage or the like in which a character performs.

In the embodiment of the present invention, a plurality of cameras may be installed at different positions of the target scene, and the plurality of cameras may photograph the target object in the target scene, and the plurality of cameras may be RGB (Red Green Blue) cameras. For example, when the target scene is a stage, a plurality of cameras may be erected around the stage, and it is sufficient that most cameras can shoot people in the stage.

For example, referring to fig. 2, the target scene is a stage, a checkerboard may be laid at a central position of the stage, and each camera around the stage may be calibrated for internal reference and external reference by using the checkerboard. Accordingly, each camera can clearly shoot the people on the stage. For example, each image shown in fig. 2 is an image of a target scene at a different viewing angle, and each image shown in fig. 2 includes an image of a clear and complete checkerboard calibration.

Furthermore, the multiple cameras synchronously shoot images of the target scene under multiple viewing angles to obtain multiple videos (i.e. original videos), and the acquisition rates of the cameras are consistent, that is, at the same time, each camera acquires an image of one frame of the target scene, so that the number of video frames contained in each original video is consistent. For example, the capture rate of each camera may be 30FPS (Frames Per Second), that is, the camera captures 30 Frames of images Per Second, and if the duration of capturing the images by each camera is 2s, a plurality of original videos including 60 Frames of images may be obtained.

Correspondingly, the electronic equipment acquires each original video of the target scene under a plurality of visual angles, and acquires a plurality of video frames with the same timestamp from each original video to obtain a video frame group.

The timestamp of one video frame indicates the position of the video frame in the original video, for example, the frame rate of the original video is 25FPS, the duration of one video frame in the original video is 40ms, the timestamp of the 1 st video frame in the original video is 40ms, the timestamp of the 2 nd video frame is 80ms, the timestamp of the 3 rd video frame is 120ms, and so on.

For a plurality of video frames with the same timestamp, the position of each video frame in the original video is the same as the positions of other video frames in the original video.

Illustratively, each original video includes: original videos of a target scene under 3 visual angles, where the 3 videos are respectively: original video 1, original video 2 and original video 3, the 3 original videos each containing 50 video frames. The electronic equipment acquires a 1 st frame in an original video 1, a 1 st frame in an original video 2 and a 1 st frame in an original video 3 to obtain a 1 st video frame group; the electronic device obtains the 2 nd frame in the original video 1, the 2 nd frame in the original video 2, and the 2 nd frame in the original video 3 to obtain the 2 nd video frame group, and so on, to obtain 50 video frame groups.

For step S102, the joints of the target object include joints of human bones. Illustratively, referring to fig. 3 (a), each joint of the target object includes 25 joints from joint No. 0 to joint No. 24. In addition to the joint points of the human skeleton shown in fig. 3 (a), the joint points of the target object include joint points of a designated part of the human body. For example, 21 joints of one hand of the person shown in fig. 3 (b).

For each video frame in a video frame group, the electronic device may perform self-gesture recognition on the video frame to obtain two-dimensional coordinates of each joint point of the target object in the video frame. For example, the electronic device inputs the video frame to a pre-trained gesture recognition model, and obtains two-dimensional coordinates of each joint point of the target object in the video frame output by the gesture recognition model. The gesture recognition model may be a 2D (two dimensional) gesture recognition model provided by openposition (open gesture).

Illustratively, referring to fig. 4 (a) and 4 (b), fig. 4 (a) and 4 (b) show 4 video frames in a video frame group, where the 4 video frames are images of a target scene at different viewing angles. Each video frame comprises 3 target objects, each rectangular area comprises one target object, the 4 video frames are input into a gesture recognition model, and two-dimensional coordinates of all joint points of a plurality of target objects in the 4 video frames are obtained.

For step S103, for each video frame group, if each video frame in the video frame group contains a target object, the electronic device calculates a target three-dimensional coordinate of each joint point of the target object in the target scene based on the two-dimensional coordinates of each joint point of the target object in each video frame in the video frame group and each conversion relationship between the image coordinate system of each video frame in the video frame group and the three-dimensional coordinate system of the target scene. The target three-dimensional coordinates of each joint point of the target object corresponding to one video frame group are as follows: and when the time of each video frame in the video frame group is acquired, the three-dimensional coordinates of each joint point of the target object in the target scene are acquired.

And if each video frame in the video frame group comprises a plurality of target objects, the electronic equipment matches the plurality of target objects in each video frame to obtain the same target object in each video frame in the video frame group. Further, for each target object, the electronic device may determine target three-dimensional coordinates of joint points of the target object in the target scene corresponding to the video frame group.

Accordingly, in some embodiments, on the basis of fig. 1, referring to fig. 5, before step S103, the method may further include the steps of:

s105: the same target object in each video frame in the video frame group is determined based on the two-dimensional coordinates of the joint points of the plurality of target objects in each video frame in the video frame group.

For each video frame group, the electronic device may determine the same target object in the video frames in the video frame group in the following manner.

In the first mode, the first step is to,

and calculating the epipolar line distance from each joint point of the target object to the corresponding epipolar line plane for each target object, and calculating the mean value of the epipolar line distances corresponding to each joint point of the target object to obtain the mean value of the distances corresponding to the target object. And calculating the similarity of each two target objects in the two video frames based on the distance mean value corresponding to each two target objects in the two video frames to obtain a first similarity matrix.

And the epipolar line plane corresponding to one joint point represents the plane of the joint point in the target scene. One element in the first similarity matrix represents: the probability that the corresponding two target objects in the two video frames are the same.

For each video frame group, each video frame in the video frame group is an image of a target scene captured by multiple cameras at different viewing angles, that is, each video frame in the video frame group contains an image of a target object at different viewing angles.

According to the multi-view geometrical principle in computer vision, the same joint point of the same target object under different visual angles is positioned in the same epipolar plane in a target scene. Therefore, under different viewing angles, the same joint point of each target object corresponds to an epipolar plane, and the epipolar distance from the joint point to the epipolar plane indicates whether the joint point is located in the epipolar plane.

Accordingly, for each video frame in the set of video frames, the electronic device selects a target object from the video frame. And for each joint point of the target object, the electronic equipment calculates a straight line between the joint point and an optical center of a camera for collecting the video frame based on the two-dimensional coordinates of the joint point in the video frame and the internal reference and the external reference of the camera for collecting the video frame, and obtains a ray corresponding to the joint point.

Furthermore, the electronic device may calculate an intersection point of the rays corresponding to the joint point of each two target objects in the target scene, to obtain a plurality of intersection points. And then, according to the three-dimensional coordinates of each intersection point in the target scene, calculating to obtain a plane equation of the epipolar line plane corresponding to the joint point in the target scene.

For each video frame in the set of video frames, the electronic device selects a target object from the video frame. For each joint point of the target object, the electronic device may calculate an epipolar line distance from each joint point of the target object to an epipolar line plane corresponding to the joint point based on a two-dimensional coordinate of the joint point in the video frame, a transformation relationship between an image coordinate system of the video frame and a three-dimensional coordinate system of a target scene, and a plane equation of the epipolar line plane corresponding to the joint point in the target scene.

Then, the electronic device calculates the mean value of the polar line distances corresponding to the joint points of the target object to obtain the mean value of the distances corresponding to the target object. And then, for every two video frames in the video frame group, calculating a sum of distance means corresponding to every two target objects in the two video frames to obtain the similarity of the two target objects.

One element in the first similarity matrix represents a probability that the corresponding two target objects in the two video frames are the same. Accordingly, the electronic device determines the same target object in each video frame in the group of video frames based on the first similarity matrix.

For example, the electronic device may cluster a plurality of target objects in each video frame of the video frame group based on the first similarity matrix, resulting in the same target object in each video frame of the video frame group. Or the electronic device may calculate to obtain the same target object in each video frame in the video frame group based on the hungarian algorithm and the first similarity matrix.

In the second way, the first way is,

for every two video frames in the video frame group, the electronic device may input the two video frames to a pre-trained object matching model to obtain a similarity of every two target objects in the two video frames, so as to obtain a second similarity matrix.

The object matching model may be a network model based on ReID (Person Re-Identification, pedestrian redirection) technology.

One element in the second similarity matrix represents the probability that the corresponding two target objects in the two video frames are the same. Accordingly, the electronic device determines the same target object in each video frame in the set of video frames based on the second similarity matrix.

The electronic device determines the same target object in each video frame of the video frame group based on the second similarity matrix, which is similar to the manner in which the electronic device determines the same target object in each video frame of the video frame group based on the first similarity matrix, and reference may be made to the related description of the foregoing embodiments.

In the third mode, the first step is to perform the first step,

in order to improve the accuracy of the determined identical target objects in the video frames in the video frame group, on the basis of fig. 5, referring to fig. 6, step S105 may include the following steps:

s1051: and calculating the epipolar line distance from each joint point of the target object to the corresponding epipolar line plane for each target object, and calculating the mean value of the epipolar line distances corresponding to each joint point of the target object to obtain the mean value of the distances corresponding to the target object.

And the epipolar line plane corresponding to one joint point represents the plane of the joint point in the target scene.

S1052: and calculating the similarity of each two target objects in the two video frames based on the distance mean value corresponding to each two target objects in the two video frames to obtain a first similarity matrix.

Wherein one element in the first similarity matrix represents: the probability that the corresponding two target objects in the two video frames are the same.

S1053: and inputting the two video frames into a pre-trained object matching model to obtain the similarity of every two target objects in the two video frames and obtain a second similarity matrix.

Wherein one element in the second similarity matrix represents: the probability that the corresponding two target objects in the two video frames are the same.

S1054: and fusing the first similarity matrix and the second similarity matrix to obtain a target similarity matrix.

S1055: and determining the same target object in each video frame in the video frame group based on the target similarity matrix.

The manner in which the electronic device obtains the first similarity matrix and the second similarity matrix may refer to the related description of the foregoing embodiments.

After the first similarity matrix and the second similarity matrix are obtained, the electronic device may fuse the first similarity matrix and the second similarity matrix to obtain a target similarity matrix.

In one implementation, the electronic device may calculate a weighted sum of each element in the first similarity matrix and a corresponding element in the second similarity matrix to obtain a target similarity matrix.

In another implementation manner, the electronic device may fuse each element in the first similarity matrix with a corresponding element in the second similarity matrix according to the following formula (1) to obtain a target similarity matrix.

A _i,j Representing the element of the ith row and the jth column in the target similarity matrix; a is _i,j An element representing the ith row and the jth column in the first similarity matrix; b _i,j Elements representing the ith row and the jth column in the second similarity matrix; w is a ₁ Representing the weight of the element in the ith row and the jth column in the first similarity matrix; w is a ₂ Representing a second similarity matrixThe weight of the element in the ith row and the jth column.

Exemplarily, referring to fig. 7, fig. 7 is a schematic diagram of a target similarity matrix according to an embodiment of the present invention, in the target similarity matrix shown in fig. 7, 11 target objects correspond to each other in the horizontal direction, 11 target objects correspond to each other in the vertical direction, a rectangular region corresponding to each two target objects represents the similarity between the two target objects, and the darker the color of the rectangular region, the higher the similarity between the two target objects. Further, the electronic device determines the same target object in each video frame in the group of video frames based on the target similarity matrix.

The electronic device determines the same target object in each video frame of the video frame group based on the target similarity matrix, which is similar to the manner in which the electronic device determines the same target object in each video frame of the video frame group based on the first similarity matrix, and reference may be made to the related description of the foregoing embodiments.

In some embodiments, on the basis of fig. 1, referring to fig. 8, step S103 may include the steps of:

s1031: for each joint point of the target object, calculating a three-dimensional coordinate of the joint point in the target scene as a three-dimensional coordinate to be processed based on two-dimensional coordinates of the joint point in each two video frames in the video frame group and each conversion relation between image coordinate systems of the two video frames and a three-dimensional coordinate system of the target scene; and calculating the average value of all three-dimensional coordinates to be processed as the initial three-dimensional coordinates of the joint point corresponding to the video frame group in the target scene.

S1032: and calculating the target three-dimensional coordinates of the joint point corresponding to the video frame group in the target scene based on the initial three-dimensional coordinates of the joint point corresponding to the video frame group in the target scene.

For each joint point of the target object, the electronic device calculates the three-dimensional coordinates to be processed of the joint point in the target scene based on the two-dimensional coordinates of the joint point in each two video frames in the video frame group and the following formula (2).

(u _m ，v _m ) Two-dimensional coordinates of the mth joint point representing the target object in a video frame, K representing an internal parameter of a camera capturing the video frame, [ R | t [ ]]External parameter (X) representing the camera that acquired the video frame _m ，Y _m ，Z _m ) And the m-th joint point of the target object is the to-be-processed three-dimensional coordinate in the target scene.

The internal and external references of the camera that captures the video frame represent: and converting the image coordinate system of the video frame and the three-dimensional coordinate system of the target scene.

The electronic device respectively takes the two-dimensional coordinates of the joint point of the target object in each two video frames in the video frame group as (u) in the above formula (2) _m ，v _m ) And taking the internal reference and the external reference of the camera for acquiring the two video frames as K [ R | t ] in the formula (2)]Two relations (X) can be obtained _m ，Y _m ，Z _m ) And solving the obtained equation set to obtain a to-be-processed three-dimensional coordinate of the mth joint point in the target scene.

Based on the two-dimensional coordinates of the joint point in each two video frames in the video frame group, a plurality of to-be-processed three-dimensional coordinates of the joint point can be obtained, and then the electronic device calculates the average value of the to-be-processed three-dimensional coordinates to obtain the initial three-dimensional coordinates of the joint point corresponding to the video frame group in the target scene.

In one implementation, for each joint point of the target object, the electronic device may directly use the initial three-dimensional coordinates of the joint point corresponding to the video frame group in the target scene as the target three-dimensional coordinates of the joint point corresponding to the video frame group in the target scene.

In another implementation, on the basis of fig. 8, referring to fig. 9, step S102 may include the following steps:

s1021: and inputting the video frame into a pre-trained gesture recognition model aiming at each video frame in the video frame group to obtain two-dimensional coordinates and corresponding confidence degrees of all joint points of the target object in the video frame.

Wherein, the confidence corresponding to the two-dimensional coordinates of a joint point represents: a probability that the joint point is located at the position represented by the two-dimensional coordinates in the video frame.

Accordingly, step S1032 may include the steps of:

s10321: and calculating the target three-dimensional coordinates of the joint point corresponding to the video frame group in the target scene based on the confidence degree corresponding to the two-dimensional coordinates of the joint point in each video frame in the video frame group and the initial three-dimensional coordinates of the joint point corresponding to the video frame group in the target scene.

And for each video frame in the video frame group, the electronic equipment inputs the video frame into a pre-trained gesture recognition model to obtain two-dimensional coordinates and corresponding confidence coefficients of all joint points of the target object in the video frame. Confidence corresponding to the two-dimensional coordinates of one joint point represents: a probability that the joint point is located at the position represented by the two-dimensional coordinates in the video frame.

If the confidence corresponding to the two-dimensional coordinate of one joint point of the target object in the video frame is low, the accuracy of the two-dimensional coordinate of the joint point of the target object in the video frame is low, and further the accuracy of the target three-dimensional coordinate of the joint point of the target object is determined to be low based on the two-dimensional coordinate of the joint point of the target object in the video frame.

In order to improve the accuracy of the target three-dimensional coordinates of the joint point of the determined target object, the electronic device may calculate the target three-dimensional coordinates of the joint point corresponding to the video frame group in the target scene in the following manner.

In the manner 1, the first and second embodiments are described,

and calculating the mean value of the confidence degrees corresponding to the two-dimensional coordinates of the joint point in each video frame in the video frame group as a third confidence degree aiming at each joint point of the target object. If the third confidence is not less than the preset threshold, it indicates that the accuracy of the initial three-dimensional coordinate of the joint point corresponding to the video frame group is higher, and the electronic device may directly use the initial three-dimensional coordinate of the joint point corresponding to the video frame group as the target three-dimensional coordinate of the joint point corresponding to the video frame group in the target scene.

The preset threshold may be set by a technician according to experience, for example, the preset threshold may be 0.2, or the preset threshold may be 0.3, but is not limited thereto.

If the third confidence is smaller than the preset threshold, it indicates that the accuracy of the initial three-dimensional coordinates of the joint point corresponding to the video frame group is low, the electronic device may calculate the target three-dimensional coordinates of the joint point corresponding to the first video frame group, and an average value of the target three-dimensional coordinates of the joint point corresponding to the second video frame, as the target three-dimensional coordinates of the joint point corresponding to the video frame group in the target scene.

The first video frame group includes: each original video is positioned in front of each video frame in the video frame group and is positioned in the same position. That is, for each video frame in the video frame group, a video frame in the first video frame group that belongs to the same original video as the video frame is located before the video frame in the original video to which the video frame belongs. And, the time stamps of the video frames in the first video frame group are the same.

The second video frame group includes: each original video is positioned after each video frame in the video frame group and is positioned at the same position. That is, for each video frame in the video frame group, a video frame in the second video frame group that belongs to the same original video as the video frame is located after the video frame in the original video to which the video frame belongs. And, the time stamps of the video frames in the second video frame group are the same

Illustratively, each video frame group includes: a video frame group 1 containing each 1 st frame in each original video, a video frame group 2 containing each 2 nd frame in each original video, a video frame group 3 containing each 3 rd frame in each original video, a video frame group 4 containing each 4 th frame in each original video, and a video frame group 5 containing each 5 th frame in each original video.

For each joint point of the target object, when the third confidence corresponding to the joint point corresponding to the video frame group 3 is smaller than a preset threshold, if the target three-dimensional coordinate of the joint point corresponding to the video frame group 2 is determined, the video frame group 2 is determined to be the first video frame group, and if the target three-dimensional coordinate of the joint point corresponding to the video frame group 2 is not determined, and the target three-dimensional coordinate of the joint point corresponding to the video frame group 1 is determined, the video frame group 1 is determined to be the first video frame group.

If the target three-dimensional coordinate of the joint point corresponding to the video frame group 4 is determined, the video frame group 4 is determined to be a second video frame group, and if the target three-dimensional coordinate of the joint point corresponding to the video frame group 4 is not determined, and the target three-dimensional coordinate of the joint point corresponding to the video frame group 5 is determined, the video frame group 5 is determined to be a second video frame group.

And further, calculating the target three-dimensional coordinates of the joint point corresponding to the first video frame group and the mean value of the target three-dimensional coordinates of the joint point corresponding to the second video frame group to obtain the target three-dimensional coordinates of the joint point corresponding to the video frame group in the target scene.

In some embodiments, if each video frame in each video frame group includes a plurality of target objects, for each two target objects in two adjacent video frame groups, the electronic device may determine, based on the initial three-dimensional coordinates of the joint points of the two target objects corresponding to the two adjacent video frame groups, whether the two target objects are the same target object, so as to implement object tracking between different video frame groups.

The electronic device may calculate a difference value of initial three-dimensional coordinates of each joint of the two target objects corresponding to the two adjacent video frame groups, and calculate a mean value between the difference values to obtain a joint error between the two target objects. Accordingly, for each target object, it is determined that the target object and the target object having the smallest joint point error are the same.

Illustratively, video frame group 1 corresponds to object 1, object 2, object 3, and video frame group 2 corresponds to object a, object B, and object C. The electronic device calculates a difference value between the initial three-dimensional coordinate of the 1 st joint of the object 1 corresponding to the video frame group 1 and the initial three-dimensional coordinate of the 1 st joint of the object a corresponding to the video frame group 2, then calculates a difference value between the initial three-dimensional coordinate of the 2 nd joint of the object 1 corresponding to the video frame group 1 and the initial three-dimensional coordinate of the 2 nd joint of the object a corresponding to the video frame group 2, and so on, and a plurality of difference values can be obtained.

Further, the average value of the difference values is calculated to obtain the joint point error between the object 1 and the object a. Similarly, the joint point error between the object 1 and the object B and the joint point error between the object 1 and the object C can be calculated, and further, if the joint point error between the object a and the object 1 is the smallest among the object a, the object B, and the object C, it can be determined that the object 1 and the object a are the same object.

In the manner 2, the first step is to perform the following operation,

if the difference between the target three-dimensional coordinates of the joint point corresponding to the video frame group, the target three-dimensional coordinates of the joint point corresponding to the first video frame group and the target three-dimensional coordinates of the joint point corresponding to the second video frame group is large, calculating the mean value of the target three-dimensional coordinates of the joint point corresponding to the first video frame group and the target three-dimensional coordinates of the joint point corresponding to the second video frame group, and obtaining the target three-dimensional coordinates of the joint point corresponding to the video frame group with low accuracy.

In order to improve the accuracy of the determined target three-dimensional coordinates of the joint point corresponding to the video frame group, on the basis of fig. 9, referring to fig. 10, step S10321 may include the following steps:

s103211: one joint point is selected from the joint points of the target object as a parent joint point.

S103212: and calculating the mean value of the confidence degrees corresponding to the two-dimensional coordinates of the child joint point in each video frame in the video frame group as a first mean confidence degree aiming at each child joint point of the parent joint point.

Wherein, the child joint of the father joint comprises: and the joint point connected with the parent joint point in the joint points of the target object.

S103213: and if the first average confidence coefficient is smaller than a preset threshold value, calculating an offset value of the target three-dimensional coordinate of the child joint point corresponding to the first video frame group relative to the target three-dimensional coordinate of the parent joint point corresponding to the first video frame group as a first offset value.

Wherein the first video frame group includes: each original video is positioned in front of each video frame in the video frame group and is positioned in the same position; the target three-dimensional coordinates of the joint points corresponding to one video frame group are as follows: determined based on the two-dimensional coordinates of the joint point in each video in the set of video frames.

S103214: and calculating an offset value of the target three-dimensional coordinate of the child joint point corresponding to the second video frame group relative to the target three-dimensional coordinate of the parent joint point corresponding to the second video frame group as a second offset value.

Wherein the second video frame group includes: each original video is positioned after each video frame in the video frame group and is positioned at the same position.

S103215: and calculating the target three-dimensional coordinates of the child joint point corresponding to the video frame group in the target scene based on the first offset value, the second offset value and the target three-dimensional coordinates of the parent joint point corresponding to the video frame group.

S103216: and if the first average confidence coefficient is not less than the preset threshold value, acquiring the initial three-dimensional coordinates of the sub-joint point corresponding to the video frame group in the target scene to obtain the target three-dimensional coordinates of the sub-joint point corresponding to the video frame group in the target scene.

In one implementation, the electronic device may select any one of the joint points of the target object as a current parent joint point, and determine each joint point connected to the current parent joint point to obtain a child joint point of the current parent joint point.

For example, for the embodiment of fig. 3 (a), the electronic device may select joint No. 1 as the current parent joint, and the child joints of the current parent joint include: knuckle point No. 2, knuckle point No. 0, knuckle point No. 5, and knuckle point No. 8. Or, the electronic device may select joint point No. 0 as the current parent joint point, and the child joint points of the current parent joint point include: knuckle No. 1, knuckle No. 15 and knuckle No. 16.

In another implementation, before step S103211, the method may further include the steps of:

step 1, determining a designated central joint point from all joint points of a target object.

And 2, calculating the mean value of the confidence degrees corresponding to the two-dimensional coordinates of the central joint point in each video frame in the video frame group as a second mean confidence degree.

And 3, if the second average confidence coefficient is smaller than a preset threshold value, calculating the target three-dimensional coordinates of the central joint point corresponding to the third video frame group and the average value of the target three-dimensional coordinates of the central joint point corresponding to the fourth video frame, and obtaining the target three-dimensional coordinates of the central node corresponding to the video frame group in the target scene.

Wherein the third video frame group includes: each original video is positioned in front of each video frame contained in the video frame group and is positioned in the same position; the fourth video frame group includes: and each original video is positioned behind each video frame contained in the video frame and is positioned in the same position.

And 4, if the second average confidence coefficient is not less than a preset threshold value, acquiring initial three-dimensional coordinates of the central joint point corresponding to the video frame group, and acquiring target three-dimensional coordinates of the central joint point corresponding to the video frame group in a target scene.

Accordingly, step S103211 may comprise the steps of: from among the joint points of the target object, a center joint point is determined as a parent joint point.

The third video frame group includes: each original video is located at the same position before each video frame included in the video frame group. That is, for each video frame in the video frame group, the video frame in the third video frame group, which belongs to the same original video as the video frame, is located before the video frame in the original video to which the video frame belongs. And, the time stamps of the video frames in the third video frame group are the same.

The fourth video frame group includes: each original video is located after each video frame included in the video frame group and is located at the same position in each original video. That is, for each video frame in the video frame group, the video frame in the fourth video frame group that belongs to the same original video as the video frame is located after the video frame in the original video to which the video frame belongs. And, the time stamps of the respective video frames in the fourth video frame group are the same.

The electronic device determines a designated center joint from among the joints of the target object, e.g., for the embodiment of fig. 3 (a), the electronic device may determine a hip joint (i.e., joint No. 8) as the center joint. And then, the electronic equipment calculates the mean value of the confidence degrees corresponding to the two-dimensional coordinates of the central joint point in each video frame in the video frame group to obtain a second mean confidence degree. The second average confidence may represent a probability that the central joint point is located at the location of the initial three-dimensional coordinate representation in the target scene.

If the second average confidence coefficient is smaller than the preset threshold value, the accuracy of the initial three-dimensional coordinates of the central joint point is low, the electronic equipment obtains the target three-dimensional coordinates of the central joint point corresponding to the third video frame group and the target three-dimensional coordinates of the central joint point corresponding to the fourth video frame group, calculates the average value of the target three-dimensional coordinates of the central joint point corresponding to the third video frame group and the target three-dimensional coordinates of the central joint point corresponding to the fourth video frame group, and obtains the target three-dimensional coordinates of the central joint point corresponding to the video frame group in the target scene.

If the second average confidence is not less than the preset threshold, indicating that the accuracy of the initial three-dimensional coordinates of the central joint point is higher, the electronic device may directly acquire the initial three-dimensional coordinates of the central joint point corresponding to the video frame group as the target three-dimensional coordinates of the central joint point corresponding to the video frame group in the target scene.

Accordingly, the electronic device may select a central joint point from the joint points of the target object as a current parent joint point, and determine each joint point connected to the current parent joint point to obtain a child joint point of the current parent joint point.

And aiming at each sub-joint point of the current parent joint point, the electronic equipment acquires the confidence degrees corresponding to the two-dimensional coordinates of the sub-joint point in each video frame in the video frame group, and calculates the average value of the acquired confidence degrees to obtain a first average confidence degree. The first average confidence may represent a probability that the joint point is located at the location of the initial three-dimensional coordinate representation in the target scene.

If the first average confidence is smaller than the preset threshold, the accuracy of the initial three-dimensional coordinates of the joint point corresponding to the video frame group is low, the electronic device calculates the difference value between the target three-dimensional coordinates of the child joint point corresponding to the first video frame group and the target three-dimensional coordinates of the parent joint point corresponding to the first video frame group, and obtains the offset value of the target three-dimensional coordinates of the child joint point corresponding to the first video frame group relative to the target three-dimensional coordinates of the parent joint point corresponding to the first video frame group as a first offset value.

And the electronic equipment calculates the difference value between the target three-dimensional coordinates of the child joint point corresponding to the second video frame group and the target three-dimensional coordinates of the parent joint point corresponding to the second video frame group to obtain the offset value of the target three-dimensional coordinates of the child joint point corresponding to the second video frame group relative to the target three-dimensional coordinates of the parent joint point corresponding to the second video frame group, and the offset value is used as a second offset value.

In one implementation, step S103215 may comprise the steps of: calculating the mean value of the first deviation value and the second deviation value as an average deviation value; and calculating the sum of the average deviation value and the target three-dimensional coordinates of the father joint point corresponding to the video frame group to obtain the target three-dimensional coordinates of the child joint point corresponding to the video frame group in the target scene.

In another implementation, the electronic device may calculate a weighted sum of the first offset value and the second offset value, and calculate a sum of the sum and a target three-dimensional coordinate of the parent joint point corresponding to the video frame group, to obtain a target three-dimensional coordinate of the child joint point corresponding to the video frame group in the target scene.

If the first average confidence is not less than the preset threshold, indicating that the accuracy of the initial three-dimensional coordinates of the joint point corresponding to the video frame group is higher, the electronic device may directly use the initial three-dimensional coordinates of the sub-joint point corresponding to the video frame group in the target scene as the target three-dimensional coordinates of the sub-joint point corresponding to the video frame group in the target scene.

In some embodiments, after step S103215, the method may further comprise the steps of:

and taking the child joint point of the target object as a parent joint point, returning and executing each child joint point aiming at the parent joint point, calculating the average value of the confidence degrees corresponding to the two-dimensional coordinates of the child joint point in each video frame in the video frame group as a first average confidence degree until the target three-dimensional coordinates of each joint point of the target object corresponding to the video frame group in the target scene are obtained.

After the target three-dimensional coordinates of the child joint point of the current parent joint point corresponding to the video frame group in the target scene are obtained through calculation, for each child joint point of the current parent joint point, the electronic device may use the child joint point as the current parent joint point, calculate again an average value of confidence degrees corresponding to the two-dimensional coordinates of each child joint point of the current parent joint point in each video frame in the video frame group as a first average confidence degree, calculate the target three-dimensional coordinates of the child joint point corresponding to the video frame group in the target scene according to the first average confidence degree, and so on until the target three-dimensional coordinates of each joint point of the target object corresponding to the video frame group in the target scene are obtained.

Exemplarily, referring to fig. 11 (a), fig. 11 (a) is a schematic diagram of a joint point of a target object according to an embodiment of the present invention. The images of fig. 11 (a) are, from left to right: a schematic diagram of each joint of the target object corresponding to each video frame group. In the 2 nd image, the 3 rd image and the 4 th image from left to right in fig. 11 (a), the target three-dimensional coordinates of the joint points of the target object are absent, the target three-dimensional coordinates of the joint points of the target object can be determined according to the method provided by the embodiment of the present invention, and the schematic diagram of the joint points of the target object as shown in fig. 11 (d) can be obtained.

For example, for the image in the middle of fig. 11 (a), a current parent joint point may be selected from the joint points of the target object, and the target three-dimensional coordinates of the child joint points of the current parent joint point are determined according to the method provided in the embodiment of the present invention, so as to obtain the schematic diagram of the joint points of the target object as shown in fig. 11 (b).

Then, the child joint point of the current parent joint point is taken as the current parent joint point, and the target three-dimensional coordinates of each child joint point of the current parent joint point are determined according to the method provided by the embodiment of the present invention, so as to obtain the schematic diagram of the joint point of the target object as shown in fig. 11 (c), and so on until the target three-dimensional coordinates of each joint point of the target object are obtained, so as to obtain the schematic diagram of the joint point of the target object as shown in fig. 11 (d).

In some embodiments, after the target three-dimensional coordinates of the joint points of the target object in the target scene corresponding to each video frame group are calculated, the electronic device may further perform smoothing processing on the determined target three-dimensional coordinates of the joint points of the target object by using Savgol (Savgol) filtering, so as to improve the accuracy of the determined target three-dimensional coordinates.

Based on the above processing, the first offset value is an offset value of the target three-dimensional coordinate of the child joint point corresponding to the first video frame relative to the target three-dimensional coordinate of the child joint point corresponding to the first video frame group, the second offset value is an offset value of the target three-dimensional coordinate of the child joint point corresponding to the second video frame relative to the target three-dimensional coordinate of the child joint point corresponding to the second video frame group, since the parent joint point of the target object is connected to the child joint point, that is, it indicates that the offset value of the child joint point relative to the parent joint point is fixed, correspondingly, the target three-dimensional coordinate of the child joint point corresponding to the video frame group in the target scene is calculated based on the first offset value, the second offset value and the target three-dimensional coordinate of the parent joint point corresponding to the video frame group, and the accuracy of the calculated target three-dimensional coordinate can be improved.

In step S104, a preset image is stored in the electronic device, and the preset image includes a virtual object that does not make any motion. After the target three-dimensional coordinates of the joint points of the target object corresponding to each video frame group in the target scene are obtained through calculation, the electronic device may adjust the three-dimensional coordinates of the joint points of the virtual object in the preset image according to the target three-dimensional coordinates of the joint points of the target object corresponding to each video frame group, so as to obtain a target video including the virtual object having the same motion as the target object.

In some embodiments, after obtaining the target three-dimensional coordinates of each joint point of the target object in the target scene corresponding to each video frame group through calculation, the electronic device may further generate a 3D (three-dimensional) Person mesh (mesh) Model in a SMPLX (Skinned Multi-Person Linear Model) format based on the target three-dimensional coordinates of each joint point of the target object. Then, the electronic device can convert the 3D character mesh model into a file in bvh (a universal human body feature animation file) format by a blender tool, and save the generated file.

Referring to fig. 12, fig. 12 is a flowchart of a method for determining target three-dimensional coordinates of a target object according to an embodiment of the present invention.

S1201: and calibrating the multiple cameras.

The electronic equipment calibrates a plurality of cameras at different positions in a target scene to determine internal parameters and external parameters of the plurality of cameras, and acquires original videos of the target scene at different viewing angles through the plurality of cameras.

S1202:2D (two-dimensional) gesture recognition.

The electronic equipment acquires a plurality of video frames with the same timestamp from each original video of a target scene under a plurality of visual angles to obtain a plurality of video frame groups. And aiming at each video frame group, the electronic equipment carries out 2D gesture recognition on each video frame in the video frame group to obtain the two-dimensional coordinates of each joint point of the target object in the video frame.

S1203: and selecting the multi-view person.

For each video frame group, if each video frame in the video frame group contains a plurality of target objects, the electronic device performs multi-view character selection, that is, the electronic device matches the plurality of target objects in each video frame to obtain the same target object in each video frame in the video frame group.

S1204:3D (three-dimensional) pose generation.

For each target object, the electronic device performs 3D pose generation on the target object, that is, based on two-dimensional coordinates of each joint point of the target object in each video frame in the video frame group and each conversion relationship between an image coordinate system of each video frame in the video frame group and a three-dimensional coordinate system of a target scene, initial three-dimensional coordinates of each joint point of the target object in the target scene corresponding to the video frame group are calculated.

S1205: and (5) frame supplementing and smoothing.

After determining the initial three-dimensional coordinates of each joint point of the target object corresponding to each video frame group in the target scene, the electronic equipment determines the target three-dimensional coordinates of each joint point of the target object based on the initial three-dimensional coordinates of each joint point of the target object. And then, the electronic equipment performs smoothing processing on the determined target three-dimensional coordinates of each joint point of the target object by using savgol filtering to obtain the final target three-dimensional coordinates of each joint point of the target object.

S1206: and (5) saving the action.

After determining the target three-dimensional coordinates of each joint point of the target object corresponding to each video frame group in the target scene, the electronic device may further generate a 3D character mesh model in the SMPLX format based on the target three-dimensional coordinates of each joint point of the target object. Then, the electronic device can convert the 3D character mesh model into a file in bvh format by a blender tool, and save the generated file.

Based on the above processing, the target three-dimensional coordinates of each joint point of the target object in the target scene can be determined by the two-dimensional coordinates of the target object in each video frame under multiple viewing angles and each conversion relationship between the image coordinate system of each video frame and the three-dimensional coordinate system of the target scene. The target three-dimensional coordinates of the joints of a target object in the target scene may represent the three-dimensional pose of the target object, i.e. the three-dimensional pose of the target object may be determined without the target object wearing a device for motion capture. Furthermore, the target video is generated based on the three-dimensional posture of the target object, so that the time cost and the labor cost for generating the video can be reduced, and the video generation efficiency is improved.

Referring to fig. 13, fig. 13 is a flowchart of another method for determining target three-dimensional coordinates of a target object according to an embodiment of the present invention.

S1301: and (4) average interpolation of the central nodes.

The electronic device performs center node average interpolation, that is, the electronic device determines a specified center joint point from all joint points of the target object; and aiming at each video frame group, the electronic equipment acquires a second average confidence corresponding to the central joint point corresponding to the video frame group. And if the second average confidence coefficient is smaller than a preset threshold value, calculating the target three-dimensional coordinates of the central joint point corresponding to the third video frame group and the average value of the target three-dimensional coordinates of the central joint point corresponding to the fourth video frame group to obtain the target three-dimensional coordinates of the central node corresponding to the video frame group in the target scene. And if the second average confidence coefficient is not less than the preset threshold value, the electronic equipment acquires the initial three-dimensional coordinates of the central joint point corresponding to the video frame group to obtain the target three-dimensional coordinates of the central joint point corresponding to the video frame group in the target scene.

S1302: and calculating the relative position offset of the unknown node and the parent node.

The electronic device determines a central joint point as a current parent joint point from among the joint points of the target object. The unknown node is a child joint point of the current parent joint point, and for each child joint point of the current parent joint point, the electronic device calculates a first average confidence corresponding to the two-dimensional coordinates of the child joint point in each video frame in the video frame group. And if the first average confidence coefficient is smaller than a preset threshold value, calculating a first offset value of the target three-dimensional coordinate of the child joint point corresponding to the first video frame group relative to the target three-dimensional coordinate of the parent joint point corresponding to the first video frame group, and calculating a second offset value of the target three-dimensional coordinate of the child joint point corresponding to the second video frame group relative to the target three-dimensional coordinate of the parent joint point corresponding to the second video frame group.

S1303: and calculating the current node position.

And the electronic device calculates the target three-dimensional coordinates of the child joint point corresponding to the video frame group in the target scene based on the first offset value, the second offset value and the target three-dimensional coordinates of the parent joint point corresponding to the video frame group.

Then, the electronic device takes the current node as a parent joint point, and continues to calculate the relationship between the child joint point and the current node, that is, the electronic device takes the child joint point of the target object as the current parent joint point, and calculates the first average confidence degree corresponding to the two-dimensional coordinates of each child joint point of the current parent joint point in each video frame in the video frame group again, and calculates the target three-dimensional coordinates of each child joint point of the current parent joint point based on the current first average confidence degree until the target three-dimensional coordinates of each joint point of the target object corresponding to the video frame group in the target scene are obtained.

S1304: savgol (Savgol) filtering.

After the target three-dimensional coordinates of the joint points of the target object corresponding to each video frame group in the target scene are obtained through calculation, the electronic equipment performs smoothing processing on the target three-dimensional coordinates of the joint points of the determined target object by means of Savgol filtering, so that the accuracy of the determined target three-dimensional coordinates is improved.

S1305: and calculating a result.

And the electronic equipment performs smoothing processing on the target three-dimensional coordinates of each joint point of the determined target object by using Savgol filtering to obtain a final calculation result, wherein the calculation result is the target three-dimensional coordinates of each joint point of the target object corresponding to each video frame group.

Based on the above processing, the first offset value is an offset value of the target three-dimensional coordinate of the child joint point corresponding to the first video frame group relative to the target three-dimensional coordinate of the child joint point corresponding to the first video frame group, the second offset value is an offset value of the target three-dimensional coordinate of the child joint point corresponding to the second video frame group relative to the target three-dimensional coordinate of the child joint point corresponding to the second video frame group, since the parent joint point of the target object is connected to the child joint point, that is, it indicates that the offset value of the child joint point relative to the parent joint point is fixed, correspondingly, the target three-dimensional coordinate of the child joint point corresponding to the video frame group in the target scene is calculated based on the first offset value, the second offset value and the target three-dimensional coordinate of the parent joint point corresponding to the video frame group, and the accuracy of the target three-dimensional coordinate obtained by calculation can be improved.

Corresponding to the embodiment of the method in fig. 1, referring to fig. 14, fig. 14 is a structural diagram of an image generating apparatus according to an embodiment of the present invention, where the apparatus includes:

an obtaining module 1401, configured to obtain original videos of a target scene at multiple viewing angles, and obtain multiple video frames with the same timestamp from each original video, respectively, to obtain a video frame group; wherein; the time stamp of one video frame represents the position of the video frame in the original video;

the recognition module 1402 is configured to perform gesture recognition on each video frame in the video frame group to obtain two-dimensional coordinates of each joint point of the target object in the video frame;

a first determining module 1403, configured to calculate, for each joint point of the target object, three-dimensional coordinates of the joint point corresponding to the video frame group in the target scene as target three-dimensional coordinates based on two-dimensional coordinates of the joint point in each video frame in the video frame group and each conversion relationship between an image coordinate system of each video frame in the video frame group and a three-dimensional coordinate system of the target scene;

a generating module 1404, configured to adjust three-dimensional coordinates of each joint point of the virtual object in each preset image according to a target three-dimensional coordinate of each joint point of the target object in the target scene corresponding to each video frame group, so as to obtain a target video including the virtual object and the target object having the same motion.

Optionally, the apparatus further comprises:

a matching module, configured to execute, at the first determining module 1403, for each joint point of the target object, based on two-dimensional coordinates of the joint point in each video frame in the video frame group and each conversion relationship between the image coordinate system of each video frame in the video frame group and the three-dimensional coordinate system of the target scene, calculate three-dimensional coordinates of the joint point in the target scene corresponding to the video frame group, as target three-dimensional coordinates, and execute, before executing, as target three-dimensional coordinates, two-dimensional coordinates of joint points of multiple target objects in each video frame in the video frame group, to determine the same target object in each video frame in the video frame group.

Optionally, the matching module is specifically configured to calculate, for each target object, an epipolar line distance from each joint point of the target object to the corresponding epipolar line plane, and calculate a mean value of each epipolar line distance corresponding to each joint point of the target object, to obtain a distance mean value corresponding to the target object; wherein, an epipolar line plane corresponding to one joint point represents a plane where the joint point is located in the target scene;

calculating the similarity of two target objects in the two video frames based on the distance mean value corresponding to each two target objects in the two video frames to obtain a first similarity matrix; wherein one element in the first similarity matrix represents: the probability that the corresponding two target objects in the two video frames are the same;

inputting the two video frames into a pre-trained object matching model to obtain the similarity of every two target objects in the two video frames and obtain a second similarity matrix; wherein one element in the second similarity matrix represents: the probability that the corresponding two target objects in the two video frames are the same;

Optionally, the first determining module 1403 is specifically configured to, for each joint point of the target object, calculate, based on two-dimensional coordinates of the joint point in each two video frames in the video frame group and each conversion relationship between the image coordinate systems of the two video frames and the three-dimensional coordinate system of the target scene, a three-dimensional coordinate of the joint point in the target scene as a to-be-processed three-dimensional coordinate; calculating the average value of each three-dimensional coordinate to be processed as the initial three-dimensional coordinate of the joint point corresponding to the video frame group in the target scene;

Optionally, the recognition module 1402 is specifically configured to, for each video frame in the video frame group, input the video frame into a pre-trained gesture recognition model to obtain two-dimensional coordinates and a corresponding confidence of each joint point of the target object in the video frame; wherein, the confidence corresponding to the two-dimensional coordinates of a joint point represents: a probability that the joint point is located at a position represented by the two-dimensional coordinates in the video frame;

the first determining module 1403 is specifically configured to calculate a target three-dimensional coordinate of the joint point corresponding to the video frame group in the target scene based on the confidence degree corresponding to the two-dimensional coordinate of the joint point in each video frame in the video frame group and the initial three-dimensional coordinate of the joint point corresponding to the video frame group in the target scene.

Optionally, the first determining module 1403 is specifically configured to select one joint from the joints of the target object as a parent joint;

calculating the mean value of confidence degrees corresponding to two-dimensional coordinates of each child joint point in each video frame in the video frame group as a first mean confidence degree aiming at each child joint point of the father joint point; wherein child joint points of the parent joint point include: the joint point connected with the father joint point in all joint points of the target object;

if the first average confidence coefficient is smaller than a preset threshold value, calculating an offset value of the target three-dimensional coordinate of the child joint point corresponding to the first video frame group relative to the target three-dimensional coordinate of the parent joint point corresponding to the first video frame group as a first offset value; wherein the first video frame group includes: each original video is positioned in front of each video frame in the video frame group and is positioned in the same position; the target three-dimensional coordinates of the joint points corresponding to one video frame group are as follows: the joint point is determined based on the two-dimensional coordinates of the joint point in each video in the video frame group;

Optionally, the apparatus further comprises:

a processing module, configured to, after the first determining module 1403 performs calculation of a target three-dimensional coordinate of the child joint point corresponding to the video frame group in the target scene based on the first offset value, the second offset value, and the target three-dimensional coordinate of the parent joint point corresponding to the video frame group, perform the step of taking the child joint point of the target object as the parent joint point, and trigger the first determining module to perform the step of performing each child joint point for the parent joint point, calculating a mean value of confidence degrees corresponding to two-dimensional coordinates of the child joint point in each video frame in the video frame group, and taking the mean value as a first mean confidence degree, until a target three-dimensional coordinate of each joint point of the target object corresponding to the video frame group in the target scene is obtained.

Optionally, the first determining module 1403 is specifically configured to calculate a mean value of the first offset value and the second offset value as an average offset value;

Optionally, the apparatus further comprises:

a second determination module, configured to determine a specified central joint point from the joint points of the target object before the first determination module 1403 performs selecting one joint point from the joint points of the target object as a parent joint point;

a fourth determining module, configured to calculate, if the second average confidence is smaller than a preset threshold, a target three-dimensional coordinate of the central joint point corresponding to a third video frame group and an average value of the target three-dimensional coordinates of the central joint point corresponding to a fourth video frame, to obtain a target three-dimensional coordinate of the central node corresponding to the video frame group in the target scene; wherein the third group of video frames comprises: each original video is positioned in front of each video frame contained in the video frame group and is positioned in the same position; the fourth group of video frames includes: each original video is positioned behind each video frame contained in the video frame and is positioned in the same position;

Based on the image generation device provided by the embodiment of the invention, the target three-dimensional coordinates of each joint point of the target object in the target scene can be determined through the two-dimensional coordinates of the target object in each video frame under multiple viewing angles and each conversion relation between the image coordinate system of each video frame and the three-dimensional coordinate system of the target scene. The target three-dimensional coordinates of the joint points of one target object in the target scene can represent the three-dimensional posture of the target object, namely the three-dimensional posture of the target object can be determined without the target object wearing a device for motion capture. Furthermore, the target video is generated based on the three-dimensional posture of the target object, so that the time cost and the labor cost for generating the video can be reduced, and the video generation efficiency is improved.

An embodiment of the present invention further provides an electronic device, as shown in fig. 15, including a processor 1501, a communication interface 1502, a memory 1503, and a communication bus 1504, where the processor 1501, the communication interface 1502, and the memory 1503 complete mutual communication through the communication bus 1504,

a memory 1503 for storing a computer program;

the processor 1501 is configured to implement the steps of the image generating method according to any of the above embodiments when executing the program stored in the memory 1503.

The communication bus mentioned in the electronic device may be a Peripheral Component Interconnect (PCI) bus, an Extended Industry Standard Architecture (EISA) bus, or the like. The communication bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one thick line is shown, but this does not mean that there is only one bus or one type of bus.

The communication interface is used for communication between the electronic equipment and other equipment.

The Memory may include a Random Access Memory (RAM) or a non-volatile Memory (non-volatile Memory), such as at least one disk Memory. Optionally, the memory may also be at least one memory device located remotely from the processor.

The Processor may be a general-purpose Processor, and includes a Central Processing Unit (CPU), a Network Processor (NP), and the like; the device can also be a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, a discrete Gate or transistor logic device, or a discrete hardware component.

In yet another embodiment of the present invention, a computer-readable storage medium is further provided, in which a computer program is stored, and the computer program, when executed by a processor, implements the image generation method described in any of the above embodiments.

In yet another embodiment provided by the present invention, there is also provided a computer program product containing instructions which, when run on a computer, cause the computer to perform the image generation method described in any of the above embodiments.

In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed on a computer, cause the processes or functions described in accordance with the embodiments of the invention to occur, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored in a computer readable storage medium or transmitted from one computer readable storage medium to another, for example, from one website site, computer, server, or data center to another website site, computer, server, or data center via wired (e.g., coaxial cable, fiber optic, digital Subscriber Line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.). The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device, such as a server, a data center, etc., that incorporates one or more of the available media. The usable medium may be a magnetic medium (e.g., floppy Disk, hard Disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., solid State Disk (SSD)), among others.

It should be noted that, in this document, relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising a … …" does not exclude the presence of another identical element in a process, method, article, or apparatus that comprises the element.

All the embodiments in the present specification are described in a related manner, and the same and similar parts among the embodiments may be referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for the apparatus, the electronic device, the computer-readable storage medium, and the computer program product embodiments, since they are substantially similar to the method embodiments, the description is relatively simple, and in relation to them, reference may be made to the partial description of the method embodiments.

The above description is only for the preferred embodiment of the present invention, and is not intended to limit the scope of the present invention. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention shall fall within the protection scope of the present invention.

Claims

1. An image generation method, characterized in that the method comprises:

performing gesture recognition on each video frame in the video frame group to obtain two-dimensional coordinates of each joint point of the target object in the video frame;

2. The method according to claim 1, wherein before calculating, for each joint point of the target object, three-dimensional coordinates of the joint point in the target scene corresponding to the video frame group as target three-dimensional coordinates based on two-dimensional coordinates of the joint point in each video frame in the video frame group and conversion relationships between an image coordinate system of each video frame in the video frame group and a three-dimensional coordinate system of the target scene, the method further comprises:

the same target object in each video frame in the video frame group is determined based on the two-dimensional coordinates of the joint points of the plurality of target objects in each video frame in the video frame group.

3. The method of claim 2, wherein determining the same target object in each video frame of the set of video frames based on two-dimensional coordinates of joint points of the plurality of target objects in each video frame of the set of video frames comprises:

calculating the epipolar line distance from each joint point of the target object to the corresponding epipolar line plane for each target object, and calculating the mean value of the epipolar line distances corresponding to each joint point of the target object to obtain the mean value of the distances corresponding to the target object; wherein, an epipolar line plane corresponding to one joint point represents a plane where the joint point is located in the target scene;

4. The method according to claim 1, wherein said calculating, for each joint point of the target object, three-dimensional coordinates of the joint point corresponding to the video frame set in the target scene as target three-dimensional coordinates based on two-dimensional coordinates of the joint point in each video frame of the video frame set and each transformation relationship between an image coordinate system of each video frame of the video frame set and a three-dimensional coordinate system of the target scene comprises:

5. The method of claim 4, wherein performing gesture recognition on each video frame in the set of video frames to obtain two-dimensional coordinates of each joint point of the target object in the video frame comprises:

6. The method of claim 5, wherein calculating the target three-dimensional coordinates of the joint point corresponding to the video frame set in the target scene based on the confidence level corresponding to the two-dimensional coordinates of the joint point in each video frame of the video frame set and the initial three-dimensional coordinates of the joint point corresponding to the video frame set in the target scene comprises:

selecting a joint point from all joint points of the target object as a father joint point;

7. The method of claim 6, wherein after calculating the target three-dimensional coordinates of the child joint point corresponding to the video frame set in the target scene based on the first offset value, the second offset value, and the target three-dimensional coordinates of the parent joint point corresponding to the video frame set, the method further comprises:

8. The method of claim 6, wherein calculating the target three-dimensional coordinates of the child joint point corresponding to the video frame set in the target scene based on the first offset value, the second offset value, and the target three-dimensional coordinates of the parent joint point corresponding to the video frame set comprises:

9. The method of claim 6, wherein prior to said selecting one of the joint points of the target object as a parent joint point, the method further comprises:

if the second average confidence coefficient is smaller than a preset threshold value, calculating the target three-dimensional coordinates of the central joint point corresponding to a third video frame group and the average value of the target three-dimensional coordinates of the central joint point corresponding to a fourth video frame to obtain the target three-dimensional coordinates of the central node corresponding to the video frame group in the target scene; wherein the third group of video frames comprises: each original video is positioned in front of each video frame contained in the video frame group, and each video frame with the same position is positioned in the original video; the fourth group of video frames includes: each original video is positioned behind each video frame contained in the video frame and is positioned at the same position;

the selecting one joint point from the joint points of the target object as a parent joint point comprises the following steps:

10. An image generation apparatus, characterized in that the apparatus comprises:

the recognition module is used for recognizing the gesture of each video frame in the video frame group to obtain two-dimensional coordinates of each joint point of the target object in the video frame;

11. An electronic device is characterized by comprising a processor, a communication interface, a memory and a communication bus, wherein the processor and the communication interface are used for realizing mutual communication by the memory through the communication bus;

a memory for storing a computer program;

a processor for implementing the method steps of any one of claims 1 to 9 when executing a program stored in a memory.

12. A computer-readable storage medium, characterized in that a computer program is stored in the computer-readable storage medium, which computer program, when being executed by a processor, carries out the method steps of any one of the claims 1-9.