CN109948689B

CN109948689B - Video generation method and device, electronic equipment and storage medium

Info

Publication number: CN109948689B
Application number: CN201910190595.4A
Authority: CN
Inventors: 安世杰; 张渊
Original assignee: Beijing Dajia Internet Information Technology Co Ltd
Current assignee: Beijing Dajia Internet Information Technology Co Ltd
Priority date: 2019-03-13
Filing date: 2019-03-13
Publication date: 2022-06-03
Anticipated expiration: 2039-03-13
Also published as: CN109948689A

Abstract

The application relates to a video generation method, a video generation device, an electronic device and a storage medium, wherein the method comprises the following steps: acquiring a depth distance value of each pixel point obtained by performing depth estimation on an image to be processed and a pixel coordinate value of each pixel point in the image to be processed; calculating a camera coordinate value of each pixel point according to the pixel coordinate value of each pixel point and the depth distance value of the pixel point; acquiring a plurality of camera coordinate change values preset for an image to be processed; reconstructing an image changed according to the camera coordinate change value according to each camera coordinate change value and the camera coordinate value of each pixel point; video corresponding to the plurality of images is generated from the plurality of images reconstructed in accordance with the plurality of camera coordinate change values. Therefore, the generated video has a three-dimensional effect, and the appreciation of the generated video is improved.

Description

Video generation method and device, electronic equipment and storage medium

Technical Field

The present application relates to the field of image processing technologies, and in particular, to a video generation method and apparatus, an electronic device, and a storage medium.

Background

With the popularization of intelligent terminal equipment and the wide application of a photographing function to the intelligent terminal equipment, an intelligent terminal equipment holder can photograph anytime and anywhere, and then photograph processing software is adopted to process the photographed photograph so as to be published on the internet. For example, the holder of the smart terminal may process the shot picture into a video using video processing software in the related art, and then distribute the video to the internet, so that the video may be made more attractive.

In the related art, when the shot photo is processed into a video, the shot photo is usually cut into different sizes, and then the video is generated according to the sequence of the cut pictures with different sizes set by the user. Therefore, video generation according to the pictures is achieved.

However, since the related art only crops the shot photograph, the generated video does not have a 3D stereoscopic effect, reducing the appreciation of the generated video.

Disclosure of Invention

Embodiments of the present application provide a video generation method, an apparatus, an electronic device, and a storage medium, so that a generated video has a stereoscopic effect, and the generated video is more enjoyable. The specific technical scheme is as follows:

according to a first aspect of embodiments of the present application, there is provided a video generation method, including:

acquiring a depth distance value of each pixel point obtained by performing depth estimation on an image to be processed and a pixel coordinate value of each pixel point in the image to be processed;

calculating a camera coordinate value of each pixel point according to the pixel coordinate value of each pixel point and the depth distance value of the pixel point;

acquiring a plurality of camera coordinate change values preset for an image to be processed;

reconstructing an image changed according to the camera coordinate change value according to each camera coordinate change value and the camera coordinate value of each pixel point;

video corresponding to the plurality of images is generated from the plurality of images reconstructed in accordance with the plurality of camera coordinate change values.

Optionally, before obtaining the depth distance value of each pixel point obtained by performing depth estimation on the image to be processed and the pixel coordinate value of each pixel point in the image to be processed, the video generation method further includes:

inputting an image to be processed to a feature extraction layer of a depth estimation neural network obtained by pre-training, and extracting a low-layer feature map, a middle-layer feature map and a high-layer feature map of the image to be processed, wherein the depth estimation neural network obtained by pre-training further comprises: the system comprises a multi-scale feature extraction layer, a feature fusion layer and a depth estimation layer;

inputting the high-level feature map into a multi-scale feature map extraction layer, and extracting a multi-scale feature map of the image to be processed;

inputting the multi-scale feature map, the low-level feature map, the middle-level feature map and the high-level feature map into a feature fusion layer for feature fusion to obtain a feature map after feature fusion;

and inputting the feature map after feature fusion into a depth estimation layer to obtain the depth distance value of each pixel point in the image to be processed.

Optionally, the depth estimation neural network obtained by pre-training is obtained by training a pre-selected training sample; pre-selecting training samples, comprising:

counting the number of samples to be selected in each sample set to be selected in the plurality of sample sets to be selected and the total number of all samples to be selected in the plurality of sample sets to be selected;

calculating the selection weight of each sample set to be selected according to the number of the samples to be selected in each sample set to be selected and the total number of all samples to be selected;

calculating the selection probability of each sample set to be selected according to the selection weight of each sample set to be selected and the number of samples to be selected in each sample set to be selected;

and selecting the samples to be selected from the sample set to be selected as training samples according to the selection probability of each sample set to be selected.

Optionally, the multi-scale feature map extraction layer includes a plurality of feature map extraction sublayers, and the feature maps extracted by each feature map extraction sublayer have different scales;

optionally, inputting the high-level feature map into a multi-scale feature map extraction layer, and extracting the multi-scale feature map of the image to be processed, including:

and simultaneously inputting the high-level feature map of the image to be processed into a plurality of feature map extraction sublayers to obtain a feature map with a scale corresponding to each feature map extraction sublayer.

Optionally, the feature fusion layer includes: a first feature fusion sublayer, a second feature fusion sublayer, and a third feature fusion sublayer;

inputting the multi-scale feature map, the low-level feature map, the middle-level feature map and the high-level feature map into a feature fusion layer for feature fusion to obtain a feature map after feature fusion, wherein the feature map comprises:

inputting the multi-scale feature map and the high-level feature map into a first feature fusion sublayer to obtain a first feature map after feature fusion;

inputting the first feature map and the middle-layer feature map after feature fusion into a second feature fusion sublayer to obtain a second feature map after feature fusion;

and inputting the second feature map and the low-level feature map after feature fusion into a third feature fusion sublayer to obtain a feature map after feature fusion.

Optionally, obtaining a plurality of camera coordinate change values preset for the image to be processed includes:

acquiring a plurality of camera coordinate change values which are set in advance for an image to be processed and have a sequence;

optionally, generating a video corresponding to the plurality of images according to the plurality of images changed according to the plurality of camera coordinate change values includes:

sequencing the plurality of images according to the sequence of the coordinate change values of the plurality of cameras to obtain a plurality of sequenced images;

and generating a video corresponding to the plurality of sequenced images according to the plurality of sequenced images.

According to a second aspect of embodiments of the present application, there is provided a video generation apparatus, including:

the depth distance value acquisition module is used for acquiring a depth distance value of each pixel point obtained by performing depth estimation on the image to be processed and a pixel coordinate value of each pixel point in the image to be processed;

the camera coordinate calculation module is used for calculating the camera coordinate value of each pixel point according to the pixel coordinate value of each pixel point and the depth distance value of the pixel point;

the camera coordinate change value acquisition module is used for acquiring a plurality of camera coordinate change values which are preset for an image to be processed;

the image reconstruction module is used for reconstructing an image changed according to the camera coordinate change value according to each camera coordinate change value and the camera coordinate value of each pixel point;

and the video generation module is used for generating videos corresponding to the plurality of images according to the plurality of images reconstructed according to the plurality of camera coordinate change values.

Optionally, the video generating apparatus further includes: a depth distance value acquisition module; a depth distance value acquisition module comprising:

the feature extraction sub-module is used for inputting the image to be processed to a feature extraction layer of the depth estimation neural network obtained through pre-training and extracting a low-layer feature map, a middle-layer feature map and a high-layer feature map of the image to be processed, wherein the depth estimation neural network obtained through pre-training further comprises: the system comprises a multi-scale feature extraction layer, a feature fusion layer and a depth estimation layer;

the multi-scale characteristic diagram acquisition submodule is used for inputting the high-level characteristic diagram into the multi-scale characteristic diagram extraction layer and extracting the multi-scale characteristic diagram of the image to be processed;

the characteristic fusion submodule is used for inputting the multi-scale characteristic diagram, the low-level characteristic diagram, the middle-level characteristic diagram and the high-level characteristic diagram into the characteristic fusion layer for characteristic fusion to obtain a characteristic diagram after the characteristic fusion;

and the depth estimation submodule is used for inputting the feature map after feature fusion into the depth estimation layer to obtain the depth distance value of each pixel point in the image to be processed.

Optionally, the depth estimation neural network obtained by pre-training is obtained by training a pre-selected training sample; the video generating device that this application embodiment provided still includes training sample selection module, and training sample selection module includes:

the sample number counting submodule is used for counting the number of the samples to be selected in each sample set to be selected in the plurality of sample sets to be selected and the total number of the samples to be selected in the plurality of sample sets to be selected;

the selection weight calculation submodule is used for calculating the selection weight of each sample set to be selected according to the number of the samples to be selected in each sample set to be selected and the total number of all samples to be selected;

the selection probability calculation submodule is used for calculating the selection probability of each sample set to be selected according to the selection weight of each sample set to be selected and the number of samples to be selected in each sample set to be selected;

and the training sample selection submodule is used for selecting the samples to be selected from the sample set to be selected as the training samples according to the selection probability of each sample set to be selected.

Optionally, the multi-scale feature map extraction layer includes a plurality of feature map extraction sublayers, the feature maps extracted by each feature map extraction sublayer have different scales, and the feature map acquisition submodule is specifically configured to:

and simultaneously inputting the high-level feature map into a plurality of feature map extraction sublayers to obtain a feature map with a scale corresponding to each feature map extraction sublayer.

Optionally, the feature fusion layer includes: a first feature fusion sublayer, a second feature fusion sublayer, and a third feature fusion sublayer; a feature fusion submodule comprising:

the first feature fusion unit is used for inputting the multi-scale feature map and the high-level feature map into a first feature fusion sublayer to obtain a first feature map after feature fusion;

the second feature fusion unit is used for inputting the first feature map and the middle-layer feature map after feature fusion into a second feature fusion sublayer to obtain a second feature map after feature fusion;

and the third feature fusion unit is used for inputting the second feature map and the low-layer feature map after feature fusion into a third feature fusion sublayer to obtain a feature map after feature fusion.

Optionally, the camera coordinate change value obtaining module is specifically configured to:

a video generation module comprising:

the image sorting submodule is used for sorting the plurality of images according to the sequence of the coordinate change values of the plurality of cameras to obtain a plurality of sorted images;

and the video generation submodule is used for generating a video corresponding to the plurality of sequenced images according to the plurality of sequenced images.

According to a third aspect of embodiments herein, there is provided an electronic device comprising a processor and a memory for storing processor-executable instructions;

wherein the processor is configured to:

Optionally, the processor is further configured to:

Optionally, the multi-scale feature map extraction layer includes a plurality of feature map extraction sublayers, and the feature maps extracted by each feature map extraction sublayer have different scales, and the processor is further configured to:

a processor further configured to:

Optionally, the processor is further configured to:

optionally, the processor is further configured to:

According to a fourth aspect of embodiments herein, there is provided a non-transitory computer readable storage medium having instructions stored thereon that, when executed by a processor of a mobile terminal, enable the mobile terminal to perform a video generation method, the method comprising:

a video corresponding to the plurality of images is generated from the plurality of images reconstructed in accordance with the plurality of camera coordinate change values.

inputting the first feature map and the middle layer feature map after feature fusion into a second feature fusion sublayer to obtain a second feature map after feature fusion;

According to a fifth aspect of embodiments of the present application, there is also provided a program product containing instructions, which when run on an electronic device, causes the electronic device to perform the steps of the video generation method provided in the first aspect.

According to a sixth aspect of embodiments of the present application, there is also provided a computer program, which, when run on an electronic device, causes the electronic device to perform the steps of the video generation method provided in the first aspect.

According to the video generation method, the video generation device, the electronic device and the storage medium provided by the embodiment of the application, when a video is generated based on an image to be processed, a depth distance value of each pixel point obtained by performing depth estimation on the image to be processed and a pixel coordinate value of each pixel point in the image to be processed can be obtained firstly; calculating a camera coordinate value of each pixel point according to the pixel coordinate value of each pixel point and the depth distance value of the pixel point; then, acquiring a plurality of camera coordinate change values preset for an image to be processed; reconstructing an image changed according to the camera coordinate change value according to each camera coordinate change value and the camera coordinate value of each pixel point; finally, a video corresponding to the plurality of images is generated from the plurality of images reconstructed in accordance with the plurality of camera coordinate change values. Thus, when a video corresponding to an image is generated according to the image to be processed, the video to be processed is processed by combining the depth distance value of each pixel point in the image to be processed and the change value of each camera coordinate, so that the change of the pixel coordinate of each pixel point in the changed image is changed based on the depth distance value of the pixel point and the change value of the camera coordinate corresponding to the changed image. Therefore, the generated video has a three-dimensional effect, the appreciation of the generated video is improved, and diversified application of the to-be-processed image can be realized. It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the application, as all of the above advantages may not necessarily be achieved in any one product or method of practicing the application.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the application and, together with the description, serve to explain the principles of the application.

FIG. 1 is a flow chart illustrating a first implementation of a video generation method in accordance with an exemplary embodiment;

FIG. 2 is a schematic block diagram illustrating a first implementation of a video generation apparatus in accordance with an illustrative embodiment;

FIG. 3 is a flow chart illustrating a second implementation of a video generation method in accordance with an exemplary embodiment;

FIG. 4 is a schematic diagram of a depth estimation neural network in one video generation method shown in FIG. 1;

FIG. 5 is a schematic diagram illustrating a second implementation of a video generation apparatus in accordance with an illustrative embodiment;

FIG. 6 is a block diagram illustrating a mobile terminal in accordance with an exemplary embodiment;

fig. 7 is a schematic diagram illustrating a configuration of a server according to an example embodiment.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

In order to solve the problems in the related art, embodiments of the present application provide a video generation method, an apparatus, an electronic device, and a storage medium, so that a generated video has a stereoscopic effect, and the appreciation of the generated video is improved. Next, a video generation method according to an embodiment of the present application will be described first.

Example one

As shown in fig. 1, a flow chart of a first implementation of a video generation method according to an exemplary embodiment is shown, where the method may include:

s110, obtaining a depth distance value of each pixel point obtained by performing depth estimation on the image to be processed and a pixel coordinate value of each pixel point in the image to be processed.

In some examples, in order to enable the generated video to have a stereoscopic effect and improve the appreciation of the generated video, a video generation method of an embodiment of the present application may generate the video in combination with depth information of an image to be processed.

For this, the image to be processed may be first input into a depth estimation neural network obtained by pre-training, so that the depth estimation neural network obtained by pre-training estimates a depth distance value of each pixel point in the image to be processed, and thus depth information of the image to be processed, that is, the depth distance value of each pixel point in the image to be processed, may be obtained.

In some examples, the pre-trained depth estimation neural network is trained from pre-selected training samples, including artwork samples and depth map samples corresponding to the artwork samples.

And S120, calculating the camera coordinate of each pixel point in the image to be processed according to the pixel coordinate of each pixel point in the pixel coordinate system and the depth distance value of the pixel point.

In some examples, when calculating the camera coordinates of each pixel point in the image to be processed, the following steps may be adopted for calculation:

step A, converting the pixel coordinate of each pixel point in the image to be processed in the pixel coordinate system into the image coordinate of the pixel point in the image coordinate system.

In some examples, each pixel point in the image to be processed has a pixel coordinate that is a coordinate in a pixel coordinate system, which may be a coordinate system with the upper left corner of the image to be processed as the origin of coordinates in some examples.

In still other examples, for an image coordinate system, the origin of coordinates of the image coordinate system is the center point of the image to be processed. In order to calculate the coordinates of the photographed object corresponding to each pixel point in the camera coordinate system, the pixel coordinates of each pixel point in the image to be processed in the pixel coordinate system may be converted into the image coordinates of the pixel point in the image coordinate system.

And step B, calculating the camera coordinates of the pixel points in a camera coordinate system according to the image coordinates of the pixel points, the depth distance values of the pixel points and the focal length of the image to be processed, and obtaining the camera coordinates of the pixel points in the image to be processed.

After obtaining the image coordinate of each pixel point in the image to be processed, for each pixel point, the depth distance value Z of the pixel point can be obtained according to the image coordinate (x, y) of the pixel point_cAnd calculating the camera coordinates of the pixel points in a camera coordinate system to obtain the camera coordinates of each pixel point in the image to be processed.

In some examples, the camera coordinates (X) of the pixel point may be calculated according to the following formula_c，Y_c，Z_c)：

S130, a plurality of camera coordinate change values set in advance for the image to be processed are acquired.

And S140, reconstructing the image changed according to the camera coordinate change value and the camera coordinate value of each pixel point.

In some examples, to enable generation of a video, a plurality of camera coordinate change values may be set for the image to be processed, at which time, a value may be changed for each camera coordinate, and then a value (Δ X) may be changed by the camera coordinate_c，ΔY_c，ΔZ_c) Camera coordinates (X) of each pixel point in the image to be processed_c，Y_c，Z_c) And the focal length f of the image to be processed, adopting the following formula:

and calculating the image coordinates (u, v) of each pixel point in the image to be processed after being changed in the image coordinate system, and then converting the changed image coordinates of each pixel point into changed pixel coordinates.

In some examples, the camera coordinate change value may be set to positive or negative, e.g., Δ X when the camera coordinate changes to the left_cIs positive, changes to the right, Δ X_cIs negative; when the camera coordinates are changed upward, Δ Y_cPositive, varying downwards, Δ Y_cIs negative; when the camera coordinates are far away from the camera, Δ Z_cIs positive, when approaching the camera, Δ Z_cIs negative.

In some examples, the above description is merely exemplary, and the plus or minus of the camera coordinate change value may also be set in other manners.

After the pixel coordinate after each pixel point in the image to be processed corresponding to each camera coordinate change value is changed is obtained, the image to be processed may be reconstructed according to the changed pixel coordinate corresponding to each pixel point according to the camera coordinate change value, so as to form an image after the image is changed according to the camera coordinate change value. Therefore, the original image to be processed can be reconstructed. Therefore, the display of the image similar to the image to be processed according to one image can be realized.

Since there are a plurality of camera coordinate change values, a plurality of images similar to the image to be processed can be obtained. That is, a plurality of images changed in accordance with a plurality of camera coordinate change values can be obtained.

S150, a video corresponding to the plurality of images is generated from the plurality of images reconstructed according to the plurality of camera coordinate change values.

After obtaining a plurality of images changed according to the plurality of camera coordinate change values, the plurality of images may be generated into a video.

In some examples, the plurality of images may also be combined with the image to be processed to generate a video.

In some examples, the method of generating a video using multiple images in the related art may be adopted, and the multiple images and the image to be processed are used together to generate the video.

In still other examples, the plurality of camera coordinate change values may be preset with a precedence order. To this end, the following steps may be taken to generate a video:

step A, sequencing a plurality of images according to the sequence of the coordinate change values of a plurality of cameras to obtain a plurality of sequenced images;

and B, generating a video corresponding to the plurality of sequenced images according to the plurality of sequenced images.

After obtaining the plurality of images changed according to the plurality of camera coordinate change values, in order to make an order of the respective image frames in the generated video identical to an order of the plurality of camera coordinate change values, the plurality of images may be sorted according to the order of the plurality of camera coordinate change values, and then a video corresponding to the plurality of sorted images may be generated according to the sorted plurality of images.

According to the video generation method, when a video is generated based on an image to be processed, a depth distance value of each pixel point obtained by performing depth estimation on the image to be processed and a pixel coordinate value of each pixel point in the image to be processed can be obtained; calculating a camera coordinate value of each pixel point according to the pixel coordinate value of each pixel point and the depth distance value of the pixel point; then, acquiring a plurality of camera coordinate change values preset for an image to be processed; reconstructing an image changed according to the camera coordinate change value according to each camera coordinate change value and the camera coordinate value of each pixel point; finally, a video corresponding to the plurality of images is generated from the plurality of images reconstructed in accordance with the plurality of camera coordinate change values. Therefore, when a video corresponding to an image is generated according to the image to be processed, the image to be processed is processed by combining the depth distance value of each pixel point in the image to be processed and the coordinate change value of each camera, so that the change of the pixel coordinate of each pixel point in the changed image is changed based on the depth distance value of the pixel point and the coordinate change value of the camera corresponding to the changed image. Therefore, the generated video has a three-dimensional effect, the appreciation of the generated video is improved, and diversified application of the to-be-processed image can be realized.

Example two

Corresponding to the foregoing method embodiment, an embodiment of the present application further provides a video generating apparatus, as shown in fig. 2, which is a schematic structural diagram of a first implementation of a video generating apparatus according to an exemplary embodiment, and the apparatus may include:

a depth distance value obtaining module 210, configured to obtain a depth distance value of each pixel point obtained by performing depth estimation on the image to be processed and a pixel coordinate value of each pixel point in the image to be processed;

the camera coordinate calculation module 220 is configured to calculate a camera coordinate value of each pixel point according to the pixel coordinate value of each pixel point and the depth distance value of the pixel point;

a camera coordinate change value acquisition module 230 configured to acquire a plurality of camera coordinate change values set in advance for the image to be processed;

an image reconstruction module 240, configured to reconstruct an image changed according to the camera coordinate change value according to each camera coordinate change value and the camera coordinate value of each pixel;

a video generating module 250, configured to generate a video corresponding to the plurality of images according to the plurality of images reconstructed according to the plurality of camera coordinate change values.

In some examples, the pixel coordinate calculation module 230 is specifically configured to:

in some examples, the video generation module 250 may include:

With regard to the apparatus in the above-described embodiment, the specific manner in which each module performs the operation has been described in detail in the embodiment related to the method, and will not be elaborated here.

When a video is generated based on an image to be processed, a depth distance value of each pixel point obtained by performing depth estimation on the image to be processed and a pixel coordinate value of each pixel point in the image to be processed can be obtained first; calculating a camera coordinate value of each pixel point according to the pixel coordinate value of each pixel point and the depth distance value of the pixel point; then, acquiring a plurality of camera coordinate change values preset for an image to be processed; reconstructing an image changed according to the camera coordinate change value according to each camera coordinate change value and the camera coordinate value of each pixel point; finally, a video corresponding to the plurality of images is generated from the plurality of images reconstructed in accordance with the plurality of camera coordinate change values. Thus, when a video corresponding to an image is generated according to the image to be processed, the video to be processed is processed by combining the depth distance value of each pixel point in the image to be processed and the change value of each camera coordinate, so that the change of the pixel coordinate of each pixel point in the changed image is changed based on the depth distance value of the pixel point and the change value of the camera coordinate corresponding to the changed image. Therefore, the generated video has a three-dimensional effect, the appreciation of the generated video is improved, and diversified application of the to-be-processed image can be realized.

EXAMPLE III

On the basis of a video generation method shown in fig. 1, a possible implementation manner is further provided in the embodiments of the present application to improve accuracy of obtaining a depth distance value, as shown in fig. 3, which is a flowchart according to a second implementation manner of a video generation method shown in an exemplary embodiment, and before obtaining a depth distance value of each pixel point obtained by performing depth estimation on an image to be processed and a pixel coordinate value of each pixel point in the image to be processed in S110, the method may further include:

and S111, inputting the image to be processed into a feature extraction layer of the depth estimation neural network obtained by pre-training, and extracting a low-layer feature map, a middle-layer feature map and a high-layer feature map of the image to be processed.

Wherein the low-level feature map comprises: the method comprises the steps that edge information, corner information, texture information and color information in an image to be processed are obtained, a middle-layer feature map comprises geometric information in the image to be processed, a high-layer feature map comprises semantic information in the image to be processed, a depth estimation neural network obtained through pre-training is obtained through training of a pre-selected training sample, and the training sample comprises an original image sample and a depth image sample corresponding to the original image sample.

In some examples, the pre-trained depth estimation neural network further comprises: the system comprises a multi-scale feature extraction layer, a feature fusion layer and a depth estimation layer.

In some examples, the geometric information may include: triangular, circular, rectangular, diamond, etc. The semantic information may include: characters, buildings, sky, etc.

For a more clear explanation of the embodiment of the present application, the following description is made with reference to a schematic structural diagram of the depth estimation neural network shown in fig. 4, and as shown in fig. 4, the depth estimation neural network obtained by pre-training may include: a feature extraction layer 410, a multi-scale feature map extraction layer 420, a feature fusion layer 430, and a depth estimation layer 440.

After the image to be processed is input into the pre-trained feature extraction layer 410, the feature extraction layer 410 may output a low-level feature map, a middle-level feature map, and a high-level feature map of the image to be processed.

In some examples, the feature extraction layer may include: a first feature extraction sublayer, a second feature extraction sublayer, a third feature extraction sublayer, and a fourth feature extraction sublayer.

In some examples, the number of input channels and the number of output channels of the first feature extraction sublayer are 3 and 16, the number of input channels and the number of output channels of the second feature extraction sublayer are 16 and 32, the number of input channels and the number of output channels of the third feature extraction sublayer are 32 and 64, and the number of input channels and the number of output channels of the fourth feature extraction sublayer are 64 and 128.

In still other examples, the image to be processed may be input into a first feature extraction sublayer obtained through pre-training to obtain a 16-channel feature map, then the 16-channel feature map may be input into a second feature extraction sublayer obtained through pre-training to obtain a 32-channel low-level feature map, the 32-channel low-level feature map may be input into a third feature extraction sublayer obtained through pre-training to obtain a 64-channel middle-level feature map, and finally, the 64-channel middle-level feature map may be input into a fourth feature extraction sublayer obtained through pre-training to obtain a 128-channel high-level feature map.

In some examples, when the depth estimation neural network is trained in advance, a plurality of training samples may be selected in advance for training, for example, a training sample in an indoor scene may be selected, a training sample in an outdoor scene may be selected, a training sample in a manned scene may be selected, or a training sample in an unmanned scene may be selected.

In some examples, one neural network may be trained for each scenario, or one neural network may be trained for all scenarios.

When a neural network is trained for all scenes, the number of training samples in each scene in all scenes may be the same or different, and in order to enable the depth estimation neural network obtained after training to be applicable to various scenes and to obtain a better estimation result for each scene in the various scenes, the training samples may be selected according to the following steps:

step A, counting the number of samples to be selected in each sample set to be selected in the plurality of sample sets to be selected and the total number of all samples to be selected in the plurality of sample sets to be selected.

In some examples, one sample set may be set for each scene, e.g., one indoor scene sample set may be set for an indoor scene, one outdoor scene sample set may be set for an outdoor scene, one unmanned scene sample set may be set for an unmanned scene, and one manned scene sample set may be set for a manned scene.

In some examples, the unmanned scene may refer to the set of unmanned scene samples having no human in the artwork and depth map of the sample to be selected.

When training samples are selected from the plurality of sets of samples to be selected, the number of samples to be selected in each set of samples to be selected and the total number of all samples to be selected in the plurality of sets of samples to be selected may be counted first.

For example, if the number of the to-be-selected samples in the indoor scene sample set is 5000, the number of the to-be-selected samples in the outdoor scene sample set is 4000, the number of the to-be-selected samples in the unmanned scene sample set is 3000, and the number of the to-be-selected samples in the manned scene sample set is 2000, the total number of the to-be-selected samples in the plurality of to-be-selected sample sets is 14000.

And B, calculating the selection weight of each sample set to be selected according to the number of the samples to be selected in each sample set to be selected and the total number of all samples to be selected.

After the total number of all samples to be selected in the multiple sample sets to be selected is obtained through statistics, the sample set to be selected can be selected for each band, and the selection weight of the sample set to be selected is calculated according to the number of the samples to be selected in the sample set to be selected and the total number of all samples to be selected.

For example, for an indoor scene sample set, a selection weight of the indoor scene sample set may be calculated to be 2.8, a selection weight of the outdoor scene sample set may be calculated to be 3.5, a selection weight of the unmanned scene sample set may be 14/3, and a selection weight of the manned scene sample set may be 7 according to the number of samples to be selected in the indoor scene sample set being 5000 and the total number of samples to be selected in the plurality of samples to be selected being 14000.

And step C, calculating the selection probability of each sample set to be selected according to the selection weight of each sample set to be selected and the number of samples to be selected in each sample set to be selected.

After the selection weight of each selected sample set is obtained through calculation, the selection probability of the sample set to be selected can be calculated according to the selection weight of the sample set to be selected and the number of samples to be selected in each sample set to be selected, so that training samples are selected from the sample set to be selected according to the selection probability.

For example, the selection probability of the indoor scene sample set may be calculated according to the selection weight 2.8 of the indoor scene sample set and the number 5000 of samples to be selected in the indoor scene sample set, the selection weight 3.5 of the outdoor scene sample set and the number 4000 of samples to be selected in the outdoor scene sample set, the selection weight 14/3 of the unmanned scene sample set and the number 3000 of samples to be selected in the unmanned scene sample set, the selection weight 7 of the manned scene sample set and the number 2000 of samples to be selected in the manned scene sample set, as follows: 2.8/(5000 × 2.8+4000 × 3.5+3000 × 14/3+2000 × 7) ═ 6.67 × 10^-5The selection probability of the outdoor scene sample set is 3.5/42000 ═ 8.3 × 10^-5The selection probability of the unmanned scene sample set is (14/3)/42000 ═ 11.1 ═ 10^-5The selection probability of the manned scene sample set is 7/42000 ═ 16.7 × 10^-5。

And D, selecting the samples to be selected from the sample set to be selected as training samples according to the selection probability of each sample set to be selected.

After the selection probability of each training rule sample set is obtained through calculation, the sample to be selected can be selected from the sample set to be selected as the training sample according to the selection probability of the sample set to be selected.

For example, 6.67 x 10 may be followed^-5May be 8.3 x 10, the samples to be selected are selected from the set of indoor scene samples^-5Selecting samples to be selected from the outdoor scene sample set according to the selection probability of 11.1 x 10^-5Selecting samples to be selected from the set of unmanned scene samples according to a selection probability of 16.7 x 10^-5And selecting a sample to be selected from the set of manned scene samples.

In some examples, when selecting training samples from a set of samples, the probability of selecting each set of samples may be the same, e.g., the probability of selecting a set of indoor scene samples is 6.67 x 10^-55000 × 0.33, probability of selecting outdoor scene sample set is 8.3 × 10^-54000 x 0.33, probability of selecting unmanned scene sample set is 11.1 x 10^-53000 ═ 0.33, the probability of selecting a sample set of manned scenes was 16.7 × 10^-5*2000＝0.33。

By the aid of the method and the device, when the training samples are selected from the multiple sample sets, the number of the samples in each sample set can be guaranteed to be the same in the selected training samples, and therefore after the deep estimation neural network is trained by the training samples, the accuracy of estimation of the deep estimation neural network obtained by training on the images to be processed of the multiple scenes can be the same or similar.

In some examples, in training a depth estimation neural network, a random gradient descent method may be employed to update parameters of the depth estimation neural network.

In some examples, when updating the parameters of the depth estimation neural network using a random map descent method, the parameters of the depth estimation neural network may be updated using the following loss function:

wherein the content of the first and second substances,x is the average value of the predicted depth of all pixel points in the original image sample, y is the average value of the actual depth of all pixel points in the depth sample corresponding to the image to be processed, and x_iPredicted depth value for the ith pixel point, c_iIs the actual depth value of the ith pixel point in the depth map sample, l_iAs a global loss function,/_gradIs a plane loss function, n is the total number of pixel points in the image to be processed, alpha is a preset adjustment parameter, l_normalThe normal may be a direction perpendicular to the image plane to be processed, as a normal loss function.

In still other examples, when the values of the three loss functions are different greatly, different weights may be set for the three loss functions to balance the difference between the values of the three loss functions, for example, assuming that the values of the three loss functions are 0.2, 0.01, and 0.006, respectively, the values of the three loss functions may be set to have weights of 1, 10, and 100, respectively, to reduce the difference between the values of the three loss functions. Therefore, the prediction of the boundary depth can be properly promoted in the training process, and the hierarchy of the model prediction depth is ensured.

In still other examples, the depth estimation results generated during the training process may also be evaluated, for example, the following formula may be used:

the root mean square error RMS and the relative error REL of the depth estimation results are evaluated.

By evaluating the depth estimation results, it can be determined when to stop training to reduce training time costs.

And S112, inputting the high-level feature map into a multi-scale feature map extraction layer, and extracting the multi-scale feature map of the image to be processed.

In some examples, after obtaining the low-level feature map, the middle-level feature map, and the high-level feature map of the image to be processed, the high-level feature map of the image to be processed may be input to the multi-scale feature map extraction layer of the depth estimation neural network obtained through pre-training, so as to extract the multi-scale feature map of the image to be processed.

In some examples of the method of the present invention,

after obtaining the multi-scale feature map of the image to be processed, in order to enable the multi-scale feature map to be input into the feature fusion layer, scale fusion can be performed on the multi-scale feature map, so that feature maps of different scales are fused into feature maps of the same scale.

In some examples, the high-level feature maps of the image to be processed may be sequentially input into the multi-scale feature map extraction layer 420 shown in fig. 4. For example, a high-level feature map may be input into a first multi-scale feature map extraction sublayer in the multi-scale feature map extraction layer 420, then the high-level feature map may be input into a second multi-scale feature map extraction sublayer in the multi-scale feature map extraction layer 420, the high-level feature map may be input into a third multi-scale feature map extraction sublayer in the multi-scale feature map extraction layer 420, and finally the high-level feature map may be input into a fourth multi-scale feature map extraction sublayer in the multi-scale feature map extraction layer 420.

In some examples, the first, second, third, and fourth multi-scale feature map extraction sublayers may output 16 × 16, 8 × 8, 4 × 4, and 1 × 1 feature maps, respectively.

In some examples, the number of input channels and the number of output channels of the feature map extraction sublayers of the four scales may be the same, for example, the number of input channels and the number of output channels of the feature map extraction sublayers of the four scales may be 128.

By setting the feature map extraction sublayers of different scales, the multi-scale feature map extraction layer 420 can extract the local features and the global features of the image to be processed, for example, the feature map extraction sublayer with the scale of 16 × 16, the feature map extraction sublayer with the scale of 8 × 8, and the feature map extraction sublayer with the scale of 4 × 4 are used, the local features of the receptive field of the image to be processed can be extracted, and the feature map extraction sublayer with the scale of 1 × 1 is used, and the global features of the image to be processed can be extracted. The local features may reflect relative positions between respective features in the image to be processed, and the global features may reflect relative positions between respective local features.

After obtaining feature maps of different scales, in order to reduce the number of output channels of the multi-scale feature map extraction layer 420, the feature maps of multiple scales may be input into the multi-scale feature map fusion sublayer to output a scale-fused feature map, and the number of output channels of the multi-scale feature map is reduced.

In some examples, when the high-level feature map of the image to be processed is input into the multi-scale feature map extraction layer of the depth estimation neural network obtained through pre-training, the high-level feature map of the image to be processed may be simultaneously input into a plurality of feature map extraction sublayers, so that a feature map of a scale corresponding to each feature map extraction sublayer may be obtained.

For example, assuming that the number of channels of the obtained feature map with the scale of 16 × 16 is 128, the number of channels of the feature map with the scale of 8 × 8 is 128, the number of channels of the feature map with the scale of 4 × 4 is 128, and the number of channels of the feature map with the scale of 1 × 1 is 128, the number of input channels of the multi-scale feature map fusion sublayer may be 512, and the number of output channels is 128. So as to fuse the number of channels of the feature map with the scale of 16 × 16, the feature map with the scale of 8 × 8, the feature map with the scale of 4 × 4 and the feature map with the scale of 1 × 1, and reduce the number of output channels of the multi-scale feature map.

And S113, inputting the multi-scale feature map, the low-level feature map, the middle-level feature map and the high-level feature map into the feature fusion layer for feature fusion to obtain a feature map after feature fusion.

After the multi-scale feature map is obtained, in order to improve the accuracy of depth estimation of the image to be processed, the multi-scale feature map, the low-level feature map, the middle-level feature map and the high-level feature map may be input to the feature fusion layer for fusion.

In some examples, in the process of deconvolving the multi-scale feature map, each time deconvolution is performed, the number of channels of the multi-scale feature map is reduced by half, and the size of the multi-scale feature map is increased by one time.

In order to improve the accuracy of depth estimation, a multi-scale feature map, a low-level feature map, a middle-level feature map and a high-level feature map can be input into a feature fusion layer of a depth estimation neural network obtained through pre-training for feature fusion, so that a feature map after feature fusion is obtained.

By inputting the low-level feature map, the middle-level feature map and the high-level feature map into the feature fusion layer, the low-level feature map, the middle-level feature map and the high-level feature map can be fused when the feature fusion layer performs deconvolution, so that the feature information of the image to be processed can be kept in the feature map after deconvolution, and the accuracy of depth estimation can be improved when the depth estimation is performed through subsequent steps.

In some examples, the feature fusion layer 430 shown in fig. 4 may include: a first feature fusion sublayer, a second feature fusion sublayer, and a third feature fusion sublayer; in step S113, inputting the multi-scale feature map, the low-level feature map, the middle-level feature map, and the high-level feature map into the feature fusion layer for feature fusion to obtain a feature map after feature fusion, which may include:

and step A, inputting the multi-scale feature map and the high-level feature map into a first feature fusion sublayer to obtain a first feature map after feature fusion.

And step B, inputting the first feature map and the middle layer feature map after feature fusion into a second feature fusion sublayer to obtain a second feature map after feature fusion.

And step C, inputting the second feature map and the low-level feature map after feature fusion into a third feature fusion sublayer to obtain a feature map after feature fusion.

In some examples, when the multi-scale feature map, the low-level feature map, the middle-level feature map, and the high-level feature map are input into a feature fusion layer of the depth estimation neural network obtained by pre-training, the multi-scale feature map and the high-level feature map may be input into a first feature fusion sublayer in the feature fusion layer 430 to obtain a first feature map after feature fusion, and then the first feature map after feature fusion and the middle-level feature map are input into a second feature fusion sublayer, so that a second feature map after feature fusion may be obtained, and finally the second feature map after feature fusion and the low-level feature map are input into a third feature fusion sublayer to obtain a feature map after feature fusion.

In some examples, after the multi-scale feature map is subjected to scale fusion, the scale-fused feature map and the high-level feature map may be input into the first feature fusion sublayer, so as to obtain the feature-fused first feature map.

In some examples, the number of channels and the size of the multi-scale feature map and the high-level feature map may be the same, the number of channels and the size of the first feature map and the middle-level feature map may be the same, and the number of channels and the size of the second feature map and the low-level feature map may be the same.

In some examples, the first feature fusion sublayer, the second feature fusion sublayer and the third feature fusion sublayer may adopt an additive fusion manner, that is, the multi-scale feature map and the high-level feature map are added, the feature-fused first feature map and the middle-level feature map are added, and the feature-fused second feature map and the low-level feature map are added, so that the amount of calculation of the feature fusion layer 430 may be reduced.

And S114, inputting the feature map after feature fusion into a depth estimation layer to obtain the depth distance value of each pixel point in the image to be processed.

After the feature map after feature fusion is obtained, the feature map after feature fusion may be input to the depth estimation layer 440 of the depth estimation neural network obtained by pre-training shown in fig. 4 for depth estimation, and the depth estimation layer 440 may output a depth distance value of each pixel point in the image to be processed.

According to the embodiment of the application, when the image to be processed is subjected to depth estimation, the image to be processed can be input into the feature extraction layer of the depth estimation neural network obtained through pre-training, the low-level feature map, the middle-level feature map and the high-level feature map of the image to be processed are obtained, then the high-level feature map of the image to be processed is input into the multi-scale feature map extraction layer, the multi-scale feature map of the image to be processed is extracted, and after the multi-scale feature map of the image to be processed is obtained, the multi-scale feature map, the low-level feature map, the middle-level feature map and the high-level feature map can be input into the feature fusion layer for feature fusion, so that the feature information of the image to be processed can be prevented from being lost during deconvolution, and the feature map after deconvolution can be ensured to keep the feature information of the image to be processed. And finally, inputting the obtained feature map after feature fusion into a depth estimation layer to obtain the depth distance value of each pixel point in the image to be processed. So that the accuracy of the depth estimation of the image to be processed can be improved.

Example four

Corresponding to the foregoing method embodiment, an embodiment of the present application further provides a video generating apparatus, as shown in fig. 5, which is a schematic structural diagram of a second implementation of the video generating apparatus according to an exemplary embodiment, and in fig. 5, the depth distance value obtaining module 210 may include:

the feature extraction submodule 211 is configured to input the image to be processed to a feature extraction layer of a depth estimation neural network obtained through pre-training, and extract a low-layer feature map, a middle-layer feature map, and a high-layer feature map of the image to be processed, where the depth estimation neural network obtained through pre-training further includes: the system comprises a multi-scale feature extraction layer, a feature fusion layer and a depth estimation layer;

the multi-scale feature map acquisition submodule 212 is used for inputting the high-level feature map into a multi-scale feature map extraction layer and extracting the multi-scale feature map of the image to be processed;

a feature fusion sub-module 213, configured to input the multi-scale feature map, the low-level feature map, the middle-level feature map, and the high-level feature map into the feature fusion layer for feature fusion to obtain a feature map after feature fusion;

and the depth estimation submodule 214 is configured to input the feature map after feature fusion into a depth estimation layer, so as to obtain a depth distance value of each pixel point in the image to be processed.

In some examples, the video generation apparatus may further include a training sample selection module, where the training sample selection module includes:

and the training sample selection submodule is used for selecting the samples to be selected from the sample sets to be selected as the training samples according to the selection probability of each sample set to be selected.

In some examples, the multi-scale feature map extraction layer includes a plurality of feature map extraction sublayers, where dimensions of feature maps extracted by each feature map extraction sublayer are different, and the feature map obtaining sub-module 212 is specifically configured to:

In some examples, the feature fusion layer includes: a first feature fusion sublayer, a second feature fusion sublayer, and a third feature fusion sublayer; the feature fusion submodule 213 may include:

the first feature fusion sub-module is used for inputting the multi-scale feature map and the high-level feature map into a first feature fusion sub-layer to obtain a first feature map after feature fusion;

the second feature fusion submodule is used for inputting the first feature map and the middle-layer feature map after feature fusion into a second feature fusion sublayer to obtain a second feature map after feature fusion;

and the third feature fusion sub-module is used for inputting the second feature map and the low-layer feature map after feature fusion into a third feature fusion sub-layer to obtain a feature map after feature fusion.

When performing depth estimation on an image to be processed, the video generation apparatus according to the embodiment of the present application may input the image to be processed into a feature extraction layer of a depth estimation neural network obtained through pre-training, to obtain a low-level feature map, a middle-level feature map, and a high-level feature map of the image to be processed, then input the high-level feature map of the image to be processed into a multi-scale feature map extraction layer, extract the multi-scale feature map of the image to be processed, and after obtaining the multi-scale feature map of the image to be processed, input the multi-scale feature map, the low-level feature map, the middle-level feature map, and the high-level feature map into a feature fusion layer for feature fusion. And finally, inputting the obtained feature map after feature fusion into a depth estimation layer to obtain the depth distance value of each pixel point in the image to be processed. So that the accuracy of the depth estimation of the image to be processed can be improved.

Fig. 6 is a schematic diagram illustrating a structure of a mobile terminal according to an exemplary embodiment. For example, the mobile terminal 600 may be a mobile phone, a computer, a digital broadcast terminal, a messaging device, a game console, a tablet device, a medical device, a fitness device, a personal digital assistant, and the like.

Referring to fig. 6, the mobile terminal 600 may include one or more of the following components: processing component 602, memory 604, power component 606, multimedia component 608, audio component 610, input/output (I/O) interface 612, sensor component 614, and communication component 616.

The processing component 602 generally controls overall operation of the mobile terminal 600, such as operations associated with display, telephone calls, data communications, camera operations, and recording operations. The processing component 602 may include one or more processors 620 to execute instructions to perform all or a portion of the steps of the methods described above. Further, the processing component 602 can include one or more modules that facilitate interaction between the processing component 602 and other components. For example, the processing component 602 can include a multimedia module to facilitate interaction between the multimedia component 608 and the processing component 602.

The memory 604 is configured to store various types of data to support operation at the device 600. Examples of such data include instructions for any application or method operating on the mobile terminal 600, contact data, phonebook data, messages, pictures, videos, and the like. The memory 604 may be implemented by any type or combination of volatile or non-volatile memory devices such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disks.

A power supply component 606 provides power to the various components of the mobile terminal 600. The power components 606 may include a power management system, one or more power sources, and other components associated with generating, managing, and distributing power for the mobile terminal 600.

The multimedia component 608 includes a screen that provides an output interface between the mobile terminal 600 and a user. In some embodiments, the screen may include a Liquid Crystal Display (LCD) and a Touch Panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive an input signal from a user. The touch panel includes one or more touch sensors to sense touch, slide, and gestures on the touch panel. The touch sensor may not only sense the boundary of a touch or slide action, but also detect the duration and pressure associated with the touch or slide operation. In some embodiments, the multimedia component 608 includes a front facing camera and/or a rear facing camera. The front camera and/or the rear camera may receive external multimedia data when the device 600 is in an operating mode, such as a shooting mode or a video mode. Each front camera and rear camera may be a fixed optical lens system or have a focal length and optical zoom capability.

The audio component 610 is configured to output and/or input audio signals. For example, the audio component 610 may include a Microphone (MIC) configured to receive external audio signals when the mobile terminal 600 is in an operational mode, such as a call mode, a recording mode, and a voice recognition mode. The received audio signal may further be stored in the memory 604 or transmitted via the communication component 616. In some embodiments, audio component 610 further includes a speaker for outputting audio signals.

The I/O interface 612 provides an interface between the processing component 602 and peripheral interface modules, which may be keyboards, click wheels, buttons, etc. These buttons may include, but are not limited to: a home button, a volume button, a start button, and a lock button.

The sensor component 614 includes one or more sensors for providing various aspects of state assessment for the mobile terminal 600. For example, the sensor component 614 may detect an open/closed state of the device 600, the relative positioning of components, such as a display and keypad of the mobile terminal 600, the sensor component 614 may detect a change in the position of the mobile terminal 600 or a component of the mobile terminal 600, the presence or absence of user contact with the mobile terminal 600, the orientation or acceleration/deceleration of the mobile terminal 600, and a change in the temperature of the mobile terminal 600. The sensor assembly 614 may include a proximity sensor configured to detect the presence of a nearby object without any physical contact. The sensor assembly 614 may also include a light sensor, such as a CMOS or CCD image sensor, for use in imaging applications. In some embodiments, the sensor assembly 614 may also include an acceleration sensor, a gyroscope sensor, a magnetic sensor, a pressure sensor, or a temperature sensor.

The communication component 616 is configured to facilitate communications between the mobile terminal 600 and other devices in a wired or wireless manner. The mobile terminal 600 may access a wireless network based on a communication standard, such as WiFi, an operator network (such as 2G, 3G, 4G, or 5G), or a combination thereof. In an exemplary embodiment, the communication component 616 receives broadcast signals or broadcast related information from an external broadcast management system via a broadcast channel. In an exemplary embodiment, the communication component 616 further includes a Near Field Communication (NFC) module to facilitate short-range communications. For example, the NFC module may be implemented based on Radio Frequency Identification (RFID) technology, infrared data association (IrDA) technology, Ultra Wideband (UWB) technology, Bluetooth (BT) technology, and other technologies.

In an exemplary embodiment, the mobile terminal 600 may be implemented by one or more Application Specific Integrated Circuits (ASICs), Digital Signal Processors (DSPs), Digital Signal Processing Devices (DSPDs), Programmable Logic Devices (PLDs), Field Programmable Gate Arrays (FPGAs), controllers, micro-controllers, microprocessors or other electronic components for performing any of the video generation methods described above.

In an exemplary embodiment, a non-transitory computer readable storage medium comprising instructions, such as the memory 604 comprising instructions, executable by the processor 620 of the mobile terminal 600 to perform the above-described method is also provided. For example, the non-transitory computer readable storage medium may be a ROM, a Random Access Memory (RAM), a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like.

According to the mobile terminal provided by the embodiment of the application, when a video is generated based on an image to be processed, a depth distance value of each pixel point obtained by performing depth estimation on the image to be processed and a pixel coordinate value of each pixel point in the image to be processed can be obtained firstly; calculating a camera coordinate value of each pixel point according to the pixel coordinate value of each pixel point and the depth distance value of the pixel point; then, acquiring a plurality of camera coordinate change values preset for an image to be processed; reconstructing an image changed according to the camera coordinate change value according to each camera coordinate change value and the camera coordinate value of each pixel point; finally, a video corresponding to the plurality of images is generated from the plurality of images reconstructed in accordance with the plurality of camera coordinate change values. Therefore, when a video corresponding to an image is generated according to the image to be processed, the image to be processed is processed by combining the depth distance value of each pixel point in the image to be processed and the coordinate change value of each camera, so that the change of the pixel coordinate of each pixel point in the changed image is changed based on the depth distance value of the pixel point and the coordinate change value of the camera corresponding to the changed image. Therefore, the generated video has a three-dimensional effect, the appreciation of the generated video is improved, and diversified application of the to-be-processed image can be realized.

Fig. 7 is a schematic diagram illustrating a configuration of a server according to an example embodiment. Referring to fig. 7, server 700 includes a processing component 722 that further includes one or more processors and memory resources, represented by memory 732, for storing instructions, such as applications, that are executable by processing component 722. The application programs stored in memory 732 may include one or more modules that each correspond to a set of instructions. Further, the processing component 722 is configured to execute instructions to perform any of the video generation methods described above.

The server 700 may also include a power component 726 configured for power management of the server 700, a wired or wireless network interface 750 configured for connecting the server 700 to a network, and an input output (I/O) interface 758. The server 700 may operate based on an operating system stored in memory 732, such as Windows Server, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM, or the like.

When a video is generated based on an image to be processed, a server provided by the embodiment of the application may first obtain a depth distance value of each pixel point obtained by performing depth estimation on the image to be processed and a pixel coordinate value of each pixel point in the image to be processed; calculating a camera coordinate value of each pixel point according to the pixel coordinate value of each pixel point and the depth distance value of the pixel point; then, acquiring a plurality of camera coordinate change values preset for an image to be processed; reconstructing an image changed according to the camera coordinate change value according to each camera coordinate change value and the camera coordinate value of each pixel point; finally, a video corresponding to the plurality of images is generated from the plurality of images reconstructed in accordance with the plurality of camera coordinate change values. Therefore, when a video corresponding to an image is generated according to the image to be processed, the image to be processed is processed by combining the depth distance value of each pixel point in the image to be processed and the coordinate change value of each camera, so that the change of the pixel coordinate of each pixel point in the changed image is changed based on the depth distance value of the pixel point and the coordinate change value of the camera corresponding to the changed image. Therefore, the generated video has a three-dimensional effect, the appreciation of the generated video is improved, and diversified application of the to-be-processed image can be realized.

Embodiments of the present application also provide a program product containing instructions, which when run on an electronic device, cause the electronic device to perform all or part of the steps of the above method.

Embodiments of the present application also provide a computer program, which when run on an electronic device, causes the electronic device to perform all or part of the steps of the above method.

It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

All the embodiments in the present specification are described in a related manner, and the same and similar parts among the embodiments may be referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for the system embodiment, since it is substantially similar to the method embodiment, the description is simple, and for the relevant points, reference may be made to the partial description of the method embodiment.

The above description is only for the preferred embodiment of the present application, and is not intended to limit the scope of the present application. Any modification, equivalent replacement, improvement and the like made within the spirit and principle of the present application are included in the protection scope of the present application.

Claims

1. A method of video generation, the method comprising:

acquiring a plurality of camera coordinate change values which are preset for the image to be processed;

calculating the changed pixel coordinate value of each pixel point according to the camera coordinate change value and the camera coordinate value of each pixel point, and reconstructing an image changed according to the camera coordinate change value according to the changed pixel coordinate value of each pixel point;

generating a video corresponding to the plurality of images from the plurality of images reconstructed according to the plurality of camera coordinate change values;

the acquiring a plurality of camera coordinate change values set in advance for the image to be processed includes: acquiring a plurality of camera coordinate change values which are set for the image to be processed in advance and have a sequence;

the generating a video corresponding to the plurality of images based on the plurality of images changed according to the plurality of camera coordinate change values includes: sequencing the plurality of images according to the sequence of the coordinate change values of the plurality of cameras to obtain a plurality of sequenced images; generating a video corresponding to the plurality of images according to the plurality of images after sequencing;

before the obtaining of the depth distance value of each pixel point obtained by performing depth estimation on the image to be processed and the pixel coordinate value of each pixel point in the image to be processed, the method further includes:

inputting the image to be processed to a feature extraction layer of a depth estimation neural network obtained by pre-training, and extracting a low-layer feature map, a middle-layer feature map and a high-layer feature map of the image to be processed, wherein the depth estimation neural network obtained by pre-training further comprises: the system comprises a multi-scale feature map extraction layer, a feature fusion layer and a depth estimation layer;

inputting the high-level feature map into the multi-scale feature map extraction layer, and extracting the multi-scale feature map of the image to be processed;

inputting the multi-scale feature map, the low-level feature map, the middle-level feature map and the high-level feature map into the feature fusion layer for feature fusion to obtain a feature map after feature fusion;

and inputting the feature map after feature fusion into the depth estimation layer to obtain the depth distance value of each pixel point in the image to be processed.

2. The method of claim 1, wherein the pre-trained deep estimate neural network is trained from pre-selected training samples; pre-selecting training samples, comprising:

counting the number of samples to be selected in each sample set to be selected in a plurality of sample sets to be selected and the total number of all samples to be selected in the plurality of sample sets to be selected;

3. The method according to claim 1, wherein the multi-scale feature map extraction layer comprises a plurality of feature map extraction sublayers, the feature maps extracted by each feature map extraction sublayer are different in scale, and the inputting the high-level feature map into the multi-scale feature map extraction layer to extract the multi-scale feature map of the image to be processed comprises:

and simultaneously inputting the high-level feature map into the plurality of feature map extraction sublayers to obtain a feature map with a scale corresponding to each feature map extraction sublayer.

4. The method of claim 1, wherein the feature fusion layer comprises: a first feature fusion sublayer, a second feature fusion sublayer, and a third feature fusion sublayer;

inputting the multi-scale feature map, the low-level feature map, the middle-level feature map and the high-level feature map into the feature fusion layer for feature fusion to obtain a feature map after feature fusion, wherein the feature map comprises:

inputting the multi-scale feature map and the high-level feature map into the first feature fusion sublayer to obtain a feature-fused first feature map;

inputting the first feature map and the middle-layer feature map after feature fusion into the second feature fusion sublayer to obtain a second feature map after feature fusion;

and inputting the second feature map after feature fusion and the low-level feature map into the third feature fusion sublayer to obtain the feature map after feature fusion.

5. A video generation apparatus, characterized in that the apparatus comprises:

the depth distance value acquisition module is used for acquiring a depth distance value of each pixel point obtained by performing depth estimation on an image to be processed and a pixel coordinate value of each pixel point in the image to be processed;

the camera coordinate change value acquisition module is used for acquiring a plurality of camera coordinate change values which are preset for the image to be processed;

the image reconstruction module is used for calculating the changed pixel coordinate value of each pixel point according to the camera coordinate change value and the camera coordinate value of each pixel point, and reconstructing an image changed according to the camera coordinate change value according to the changed pixel coordinate value of each pixel point;

a video generation module for generating a video corresponding to the plurality of images based on the plurality of images reconstructed according to the plurality of camera coordinate change values;

the camera coordinate change value acquisition module is specifically configured to: acquiring a plurality of camera coordinate change values which are set for the image to be processed in advance and have a sequence;

a video generation module comprising: the image sorting submodule is used for sorting the images according to the sequence of the coordinate change values of the cameras to obtain a plurality of sorted images; the video generation sub-module is used for generating videos corresponding to the sequenced images according to the sequenced images;

the depth distance value obtaining module includes:

a feature extraction submodule, configured to input the image to be processed to a feature extraction layer of a depth estimation neural network obtained through pre-training, and extract a low-level feature map, a middle-level feature map, and a high-level feature map of the image to be processed, where the depth estimation neural network obtained through pre-training further includes: the system comprises a multi-scale feature map extraction layer, a feature fusion layer and a depth estimation layer;

the multi-scale feature map acquisition submodule is used for inputting the high-level feature map into the multi-scale feature map extraction layer and extracting the multi-scale feature map of the image to be processed;

the feature fusion submodule is used for inputting the multi-scale feature map, the low-level feature map, the middle-level feature map and the high-level feature map into the feature fusion layer for feature fusion to obtain a feature map after feature fusion;

and the depth estimation submodule is used for inputting the feature map after the feature fusion into the depth estimation layer to obtain a depth distance value of each pixel point in the image to be processed.

6. The apparatus of claim 5, wherein the pre-trained deep estimation neural network is trained from pre-selected training samples; the device further comprises: a training sample selection module, the training sample selection module comprising:

the sample number counting submodule is used for counting the number of samples to be selected in each sample set to be selected in a plurality of sample sets to be selected and the total number of all samples to be selected in the plurality of sample sets to be selected;

the selection weight calculation submodule is used for calculating the selection weight of each sample set to be selected according to the number of the samples to be selected in each sample set to be selected and the total number of all the samples to be selected;

7. The apparatus according to claim 5, wherein the multi-scale feature map extraction layer includes a plurality of feature map extraction sublayers, the feature maps extracted by each of the feature map extraction sublayers have different scales, and the multi-scale feature map acquisition submodule is specifically configured to:

8. The apparatus of claim 5, wherein the feature fusion layer comprises: a first feature fusion sublayer, a second feature fusion sublayer, and a third feature fusion sublayer; the feature fusion submodule includes:

the first feature fusion unit is used for inputting the multi-scale feature map and the high-level feature map into the first feature fusion sublayer to obtain a feature-fused first feature map;

the second feature fusion unit is used for inputting the first feature map and the middle-layer feature map after feature fusion into the second feature fusion sublayer to obtain a second feature map after feature fusion;

and the third feature fusion unit is used for inputting the second feature map after feature fusion and the low-layer feature map into the third feature fusion sublayer to obtain the feature map after feature fusion.

9. An electronic device comprising a processor and a memory for storing processor-executable instructions;

wherein the processor is configured to: the method of any one of claims 1-4 when executed by executable instructions stored on the memory.

10. A non-transitory computer readable storage medium having instructions stored thereon that, when executed by a processor of a mobile terminal, enable the mobile terminal to perform the method of any of claims 1-4.