CN112651881B

CN112651881B - Image synthesizing method, apparatus, device, storage medium, and program product

Info

Publication number: CN112651881B
Application number: CN202011619097.6A
Authority: CN
Inventors: 卢飞翔; 刘宗岱; 张良俊
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd; Baidu USA LLC
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd; Baidu USA LLC
Priority date: 2020-12-30
Filing date: 2020-12-30
Publication date: 2023-08-01
Anticipated expiration: 2040-12-30
Also published as: CN112651881A

Abstract

The disclosure discloses an image synthesis method, an image synthesis device, a storage medium and a program product, and relates to the technical field of image processing. The specific implementation scheme is as follows: performing texture complement processing on an image comprising a first view angle of a first target object to obtain a texture map of the first target object; generating a three-dimensional model of the first target object by using the texture map; according to azimuth information of the scene image of the second view angle, projecting the three-dimensional model of the first target object to obtain a two-dimensional image of the first target object; and superposing the two-dimensional image of the first target object into the scene image to obtain a composite image of the second visual angle. The embodiment of the disclosure can obviously reduce the cost of data synthesis, provide a large amount of training data for training the deep neural network, and greatly reduce the consumption of manpower, material resources and financial resources.

Description

Image synthesizing method, apparatus, device, storage medium, and program product

Technical Field

The present disclosure relates to the field of computer technologies, and in particular, to the field of image processing technologies.

Background

Machine learning model training typically requires a large number of annotated multi-perspective images as a training set. Taking the application scene of vehicle and road cooperation as an example, visual sensors can be arranged at the tops of vehicles and at the positions of crossing telegraph poles and traffic lights, and the vehicles on the roads can be subjected to multi-view detection, segmentation and pose estimation. Vehicle-road coordination is an important way to achieve automatic driving. The difficulty in shielding vehicles can be effectively solved by utilizing vehicle road cooperation, and the safety of the automatic driving technology is greatly improved. However, the conventional method requires a large number of annotated multi-view images as a training set, and then performs network model training. Training data of multi-view images is difficult to obtain in traffic scenes, and the data is difficult to annotate.

Disclosure of Invention

The present disclosure provides an image synthesizing method, apparatus, device, storage medium, and program product.

According to an aspect of the present disclosure, there is provided an image synthesizing method including:

performing texture complement processing on an image comprising a first view angle of a first target object to obtain a texture map of the first target object;

generating a three-dimensional model of the first target object by using the texture map;

according to azimuth information of the scene image of the second view angle, projecting the three-dimensional model of the first target object to obtain a two-dimensional image of the first target object;

and superposing the two-dimensional image of the first target object into the scene image to obtain a composite image of the second visual angle.

According to another aspect of the present disclosure, there is provided an image synthesizing apparatus including:

the processing unit is used for carrying out texture complement processing on the image comprising the first view angle of the first target object to obtain a texture map of the first target object;

a generation unit for generating a three-dimensional model of the first target object using the texture map;

the projection unit is used for projecting the three-dimensional model of the first target object according to the azimuth information of the scene image of the second view angle to obtain a two-dimensional image of the first target object;

and the superposition unit is used for superposing the two-dimensional image of the first target object into the scene image to obtain a composite image of the second visual angle.

According to still another aspect of the present disclosure, there is provided an electronic apparatus including:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the methods provided by any one of the embodiments of the present disclosure.

According to yet another aspect of the present disclosure, there is provided a non-transitory computer-readable storage medium storing computer instructions for causing the computer to perform the method provided by any one of the embodiments of the present disclosure.

According to yet another aspect of the present disclosure, there is provided a computer program product comprising a computer program which, when executed by a processor, implements the method provided by any of the embodiments of the present disclosure.

One embodiment of the above application has the following advantages or benefits: the cost of data synthesis can be obviously reduced, a large amount of training data is provided for training of the deep neural network, and the consumption of manpower, material resources and financial resources is greatly reduced. Taking a vehicle as a target object as an example, the embodiment of the disclosure can provide a plurality of marked multi-view images for network model training, can improve the accuracy of vehicle-road cooperative tasks, improve the performance of environmental perception, and can effectively improve the safety of an automatic driving vehicle.

It should be understood that the description in this section is not intended to identify key or critical features of the embodiments of the disclosure, nor is it intended to be used to limit the scope of the disclosure. Other features of the present disclosure will become apparent from the following specification.

Drawings

The drawings are for a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:

FIG. 1 is a flow chart of an image compositing method according to an embodiment of the disclosure;

FIG. 2 is a flow chart of texture completion for an image synthesis method according to another embodiment of the present disclosure;

FIG. 3 is a flow chart of three-dimensional model reconstruction for an image synthesis method according to another embodiment of the present disclosure;

FIG. 4 is a flow chart of image projection of an image compositing method according to another embodiment of the disclosure;

FIG. 5 is a flow chart of image restoration of an image synthesis method according to another embodiment of the present disclosure;

FIG. 6 is a flow chart of an image compositing method according to another embodiment of the disclosure;

fig. 7 is a schematic view of data diversity effect of an image synthesizing method according to another embodiment of the present disclosure;

FIG. 8 is a schematic diagram of an image compositing apparatus according to an embodiment of the disclosure;

fig. 9 is a schematic diagram of an image synthesizing apparatus according to another embodiment of the present disclosure;

fig. 10 is a block diagram of an electronic device for implementing an image compositing method according to an embodiment of the disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below in conjunction with the accompanying drawings, which include various details of the embodiments of the present disclosure to facilitate understanding, and should be considered as merely exemplary. Accordingly, one of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

Machine learning model training typically requires a large number of annotated multi-perspective images as a training set. Taking the application scene of the vehicle-road cooperation as an example, the vehicle-road cooperation can be used for detecting vehicles on the road at multiple visual angles, so that the difficulty of shielding the vehicles is effectively solved, and the safety of the automatic driving technology is greatly improved. However, the conventional method requires a large number of annotated multi-view images as a training set, and then performs network model training. Training data of multi-view images is difficult to obtain in traffic scenes, and the data is difficult to annotate.

Taking an application scene of vehicle-road cooperation as an example, the method for generating the multi-angle image in the related technology mainly comprises the following technical schemes:

(1) And (5) rendering a three-dimensional model. This approach requires the construction of a large number of three-dimensional models of vehicles and of cities. Data such as texture mapping, scene illumination, rendering parameters and the like of the model need to be adjusted, and image rendering is performed by using rendering software such as 3dsMax and the like. The scheme has high cost, low efficiency and difficult guarantee of the effect, and the obtained image data is difficult to train on the network.

(2) And predicting the image under the new view angle by utilizing the corresponding relation of the training pixel points of the multi-view image. This approach requires a large number of annotated multi-view images as a training set. Training data is difficult to obtain in traffic scenes and data is difficult to annotate.

(3) Image synthesis is performed by means of generating a countermeasure network (GAN, generative Adversarial Networks). This scheme requires two or more image pairs (image pairs) as training data, which are difficult to acquire. In addition, GAN networks are difficult to train and as a result are difficult to control. The greatest defect of the scheme is that a corresponding labeling result cannot be automatically generated.

Therefore, the method for synthesizing the multi-view images is oriented to the vehicle-road cooperative task. Fig. 1 is a flowchart of an image synthesizing method according to an embodiment of the present disclosure. Referring to fig. 1, the image synthesizing method includes:

step S110, performing texture complement processing on an image comprising a first view angle of a first target object to obtain a texture map of the first target object;

step S120, generating a three-dimensional model of the first target object by using the texture map;

step S130, according to azimuth information of the scene image of the second view angle, projecting the three-dimensional model of the first target object to obtain a two-dimensional image of the first target object;

step S140, the two-dimensional image of the first target object is superimposed on the scene image, so as to obtain a composite image of the second view angle.

Wherein, in step S110 and step S120, three-dimensional model reconstruction of the first target object is performed, and in step S130 and step S140, projection of the three-dimensional model of the first target object is superimposed into the scene image, resulting in a composite image of a new view angle.

In the task of three-dimensional model reconstruction of a first target object, it is often necessary to reconstruct a texture map of the three-dimensional model from the monocular image. Due to the singleness of the monocular image capture perspective, a complete texture map of the first target object cannot be obtained. Taking the vehicle as the first target object as an example, the vehicle is photographed from the front, and the tail lamp of the vehicle cannot be photographed. In addition, since the photographing angle of view is single, image textures of some parts in the image of the first target object may be incomplete. Therefore, the missing part in the first target object needs to be complemented so as to reconstruct the three-dimensional model of the first target object.

In step S110, a captured image including a first perspective of a first target object may be first acquired. For example, the image of the first angle of view may be a front view taken from the front. The texture complement processing can be performed on the image including the first view angle of the first target object by utilizing a pre-trained deep neural network, so as to obtain a texture map of the first target object. In step S120, three-dimensional model reconstruction is performed using the texture map obtained in step S110, and a three-dimensional model of the first target object is generated.

In step S130, the photographed scene image of the second view angle and the azimuth information of the scene image may be first acquired. For example, the scene image at the second viewing angle may be a top view of a road scene taken from a high elevation down. In one example, orientation information for a scene image may be obtained from camera parameters. The azimuth information may include three-dimensional geometric information of the road scene, including plane equations, normal directions, and the like. According to the azimuth information, the three-dimensional model of the first target object can be subjected to projection operation, and a two-dimensional image of the first target object is obtained. For example, a three-dimensional model of the first target object may be projected onto a plane determined by a plane equation of the road scene, resulting in a two-dimensional image of the first target object. And the placing position of the three-dimensional model in the road scene is coordinated with the three-dimensional geometric information of the road scene through projection operation.

In step S140, the two-dimensional image of the first target object obtained in step S110 is superimposed on the scene image, resulting in a composite image at the second viewing angle.

In the application scene of the vehicle-road coordination, most of the images shot by various visual sensors may be images of a first visual angle, and the number of images of a second visual angle is smaller. Embodiments of the present disclosure may synthesize an image of a second perspective using an image of a first perspective of a first target object and a scene image of the second perspective. By using the embodiment of the invention to generate the image, the cost of data synthesis can be obviously reduced, a large amount of training data is provided for training the deep neural network, and the consumption of manpower, material resources and financial resources is greatly reduced. Taking a vehicle as a target object as an example, the embodiment of the disclosure can provide a plurality of marked multi-view images for network model training, can improve the accuracy of vehicle-road cooperative tasks, improve the performance of environmental perception, and can effectively improve the safety of an automatic driving vehicle.

Fig. 2 is a flow chart of texture completion for an image synthesis method according to another embodiment of the present disclosure. The image synthesizing method of this embodiment may include the steps of the above-described embodiments. In addition, as shown in fig. 2, in an embodiment, step S110 in fig. 1, performing texture complement processing on an image including a first view angle of a first target object to obtain a texture map of the first target object may specifically include:

step S210, dividing an image of a first view angle comprising a first target object to obtain a divided image comprising at least one component of the first target object;

step S220, marking the pose of the first target object in the image comprising the first view angle of the first target object to obtain pose marking information;

step S230, projecting the segmented image according to pose labeling information to obtain a to-be-processed image of the first target object;

and step S240, performing texture complement processing on the image to be processed by using the deep neural network to obtain a texture map of the first target object.

In step S210, an image including a first view angle of a first target object is first segmented to obtain a segmented image including at least one component of the first target object.

Taking a vehicle as a first target object as an example, a model object to be reconstructed is divided into a plurality of parts. For example, the vehicle may be divided into a plurality of parts such as 4 wheels, a front cover, a rear cover, and a tail lamp. In one example, if the captured image of the vehicle is taken from the front, there may be only a front cover and 2 front wheels in the image, and no rear cover and tail lights. That is, some parts may be visible in the captured image and some parts may not be visible in the captured image. In addition, due to the limitation of the shooting angle, the image textures of the front cover and the 2 front wheels in the image may also be incomplete. The captured image of the vehicle may be segmented to obtain a segmented image that includes the various components in the image.

In one example, the segmented image may be taken as a to-be-processed image of at least one component of the first target object.

In another example, in step S220, the pose of the first target object may also be labeled in the image including the first target object, to obtain pose labeling information. Although the same first target object is photographed, the pose of the first target object presented on the image may be different due to the photographing angle, and the images of the respective parts of the first target object may be different due to the photographing angle. Therefore, the pose of the first target object can be identified by utilizing an identification algorithm, and pose labeling information is obtained. The pose labeling information can also be obtained by a manual labeling mode.

In one embodiment, the pose annotation information may include a six degree of freedom spatial pose. The six degrees of freedom of the object in space may include a degree of freedom of movement in the directions of three orthogonal coordinate axes of x, y, and z and a degree of freedom of rotation about the three coordinate axes. Thus, the position of the object can be determined using the six degrees of freedom spatial pose.

In step S230, the segmented image is projected according to the pose labeling information, and the image projection algorithm may be used to perform a projection operation on the segmented image, so as to correct the deviation of the segmented image caused by different poses of the first target object, and obtain the to-be-processed image of at least one component of the first target object after projection.

In step S240, a texture complement process is performed on the image to be processed by using the pre-trained deep neural network, so as to obtain a texture map of the first target object. In one example, a texture completion process may be performed on an image to be processed using a graph neural network model. Specifically, the data structure of the association graphs of all the components of the first target object may be constructed in advance. In the data structure of the association graph, each node element in the association graph is used to represent a component of the first target object. In an example where a vehicle is the first target object, n nodes may be included in the association graph, each node representing a component of the vehicle, such as a wheel, a front cover, a tail light, and the like. When the image including the first target object is segmented in step S210, the image segmentation is also performed according to the nodes defined in the data structure of the association graph. Each part in the segmented image to be processed can find the node corresponding to the part in the association graph.

For a component visible in the captured image comprising the first target object, the node corresponding to the component can be found in the association graph. The images of each component in the image to be processed can be assigned to the corresponding node elements in the association graph. For a component that is not visible in the captured image that includes the first target object, that is, a component that is not captured in the image, the node corresponding to the component is assigned as a null node in the association graph. And finally, constructing a correlation diagram of all the components of the first target object by using node elements corresponding to all the assigned components.

And inputting the constructed association diagram of the first target object into the graphic neural network model. In the input association diagram, nodes in the diagram represent images of components of the first target object, image textures of some components may be incomplete, and image textures of other components may be completely absent. And (3) complementing incomplete or completely absent image textures in the input association diagram by using the graphic neural network model, and outputting texture-complemented images of all parts of the first target object.

According to the embodiment of the invention, the high-quality complete three-dimensional texture map can be generated aiming at the first target object, the cost of three-dimensional texture reconstruction can be obviously reduced, and the omnibearing simulation rendering of the target object is realized. Taking a vehicle as a first target object as an example, the automatic driving simulation database can be greatly enriched through three-dimensional model reconstruction of the vehicle, and abundant resources are provided for perception system training.

Fig. 3 is a flow chart of three-dimensional model reconstruction for an image synthesis method according to another embodiment of the present disclosure. The image synthesizing method of this embodiment may include the steps of the above-described embodiments. In addition, as shown in fig. 3, in an embodiment, step S120 in fig. 1, generating a three-dimensional model of the first target object using the texture map may specifically include:

step S310, obtaining deformation parameters of a deformable template of a first target object, wherein the deformation parameters correspond to the appearance shape of the first target object;

step S320, a three-dimensional model of the first target object is generated according to the deformation parameters and the texture map of the deformable template.

Taking a vehicle as a first target object as an example, the deformable template is used to generate vehicles of different shapes in appearance. The deformation parameters of the deformable templates correspond to different vehicle appearance shapes. The overall exterior shape of the vehicle may vary from vehicle to vehicle, and the shape of the various components that make up the vehicle may also vary. Corresponding deformable templates can be created according to the shapes of various components of different vehicle models. The texture map in the deformable template is a predefined texture contour, and the texture in the contour can be filled for image texture completion. And (3) adjusting deformation parameters of the deformable template, and combining the completed texture map to generate a three-dimensional model of the vehicle.

In the embodiment of the disclosure, the three-dimensional model reconstruction of the vehicle is realized through the deformable template and the texture complement, so that an automatic driving simulation database can be greatly enriched, and abundant resources are provided for the training of a perception system.

Fig. 4 is a flowchart of image projection of an image composition method according to another embodiment of the present disclosure. The image synthesizing method of this embodiment may include the steps of the above-described embodiments. In addition, as shown in fig. 4, in an embodiment, step S130 in fig. 1, projecting the three-dimensional model of the first target object to obtain the two-dimensional image of the first target object according to the azimuth information of the scene image at the second viewing angle may specifically include:

step S410, according to shooting parameters of the scene image of the second view angle, azimuth information of the scene image is obtained, wherein the azimuth information comprises a plane equation of the scene image;

step S420, adjusting the pose of the three-dimensional model of the first target object, and putting the three-dimensional model of the first target object on a plane determined by a plane equation;

step S430, projecting the released three-dimensional model of the first target object to obtain a two-dimensional image of the first target object.

Wherein the photographing parameters of the scene image of the second view angle may include camera parameters. The camera parameters may include at least one of internal and external parameters of the camera. The internal parameters of the camera may include the focal length. The external parameters of the camera may include the camera position. In step S410, when capturing the captured scene image at the second angle of view, the capturing parameters may be simultaneously acquired. And obtaining the azimuth information of the scene image according to the shooting parameters. The azimuth information may include three-dimensional geometric information of the road scene. The three-dimensional geometric information may include plane equations, normal, and the like.

Taking the vehicle as the first target object as an example, in step S420, the pose of the three-dimensional model of the vehicle is adjusted according to the azimuth information of the scene image, and the three-dimensional model of the vehicle is put on the plane determined by the plane equation. And adjusting the pose of the three-dimensional model of the vehicle according to the azimuth information of the scene image, so that the placement position of the three-dimensional model in the road scene is consistent with the three-dimensional geometric information of the road scene. In step S430, the three-dimensional model of the vehicle after delivery is projected to obtain a two-dimensional image of the vehicle.

According to the azimuth information of the scene image at the second view angle, the three-dimensional model of the first target object is projected to obtain a two-dimensional image, so that the placement position of the three-dimensional model in the road scene is coordinated and consistent with the three-dimensional geometric information of the road scene, and the synthesized image effect is more real.

Fig. 5 is a flowchart of image restoration of an image synthesis method according to another embodiment of the present disclosure. The image synthesizing method of this embodiment may include the steps of the above-described embodiments. Furthermore, as shown in fig. 5, in one embodiment, the method further includes:

step S510, removing a second target object in the photographed standby image at the second viewing angle by using an image restoration method;

in step S520, the standby image from which the second target object is removed is taken as the scene image at the second viewing angle.

In this embodiment, after photographing based on the second angle of view, the photographed image is used as the standby image. And after the standby image is subjected to restoration processing, taking the standby image after the restoration processing as a scene image of the second visual angle. Pedestrians and vehicles may be present in the road scene of the photographed standby image of the second view angle. Taking the vehicle reconstructed by the three-dimensional model as a first target object as an example, pedestrians and vehicles in the standby image of the second view angle are taken as second target objects. The second target object may be removed from the standby image using an image restoration method, and the standby image from which the second target object is removed may be used as the scene image at the second view angle.

In the application scene of the vehicle-road coordination, the number of possible images of the second visual angle in the images shot by the visual sensor is small. By using the method, a large number of images at the second view angle can be generated, a large number of multi-view images can be provided for training the network model, and the robustness of the model is improved.

In one embodiment, the method further comprises:

and obtaining the annotation information of the composite image according to the position information of the first target object in the composite image.

The labeling information may include two-dimensional labeling information and three-dimensional labeling information. The two-dimensional annotation information may include at least one of a "two-dimensional bounding box" and an "instance-level segmentation. The "two-dimensional bounding box" includes labeling information of the overall position of the vehicle. An "instance-level segmentation" includes segmenting a vehicle into components, marking the location of each component. The three-dimensional annotation includes at least one of a "three-dimensional bounding box" and a "six-degree-of-freedom spatial pose".

The image synthesis method disclosed by the embodiment of the invention can synthesize images with multiple visual angles, and automatically generate corresponding two-dimensional annotation information and three-dimensional annotation information, thereby greatly reducing the cost of acquiring training data and effectively improving the robustness of the deep learning model.

Fig. 6 is a flowchart of an image synthesizing method according to another embodiment of the present disclosure. The various reference numerals in fig. 6 are as follows:

reference numeral 1 denotes a Source image (Source), which is a Front View;

reference numeral 2 denotes a Target image (Target), which is a Top View;

the reference numeral (a) denotes a deformable Vehicle Template and a six-degree-of-freedom space Pose notation (Vehicle Template & supported 6-DOF Pose);

reference numeral (b) denotes a component-based texture map completion (Part based Texture Inpainting);

reference numeral (c) denotes model-based view synthesis (Model based View Synthesis);

reference numeral (d) denotes a background image (Background Images with Camera Calibration) with camera calibration;

reference numeral (e) denotes a background image restoration (Background Inpainting);

the reference numeral (f) denotes a three-dimensional structure (3D structyre of Background) of the background image;

the reference (g) indicates the composite result (Novel-view Results with Ground-Truth Annotations) with the new view angle noted.

Referring to fig. 1 to 6, as indicated by reference numeral (a) in fig. 6, input information of the three-dimensional reconstruction task may include a single traffic scene image, a six-degree-of-freedom spatial pose of each vehicle noted in the image, and a deformable template of the three-dimensional vehicle, for the vehicle object. The deformable template may contain texture maps therein. As indicated by reference numeral (b), image pixels are projected onto the texture map according to the noted six-degree-of-freedom spatial pose. Then training a deep neural network to fill the missing area of the texture map. As indicated by reference numeral (c), the deformation parameters of the deformable templates of the three-dimensional vehicle model are then adjusted to produce a number of different three-dimensional vehicle models. And rendering the model by combining the generated texture map to obtain a two-dimensional image of the vehicle.

As indicated by reference numeral (d), an image of an intersection can be acquired for the background image portion. The background image portion may be a scene image of a second perspective that is background to the three-dimensional vehicle model. The internal parameters and external parameters of a camera for shooting images are calibrated in advance. As shown in reference numeral (e), the vehicle in the background image portion is removed using the existing image restoration (image inpainting) method. As shown by the reference number (f), the three-dimensional geometric information of the intersection is restored by using the internal parameters and the external parameters of the camera. The three-dimensional geometric information includes plane equations, normal directions, and the like. Finally, as shown in reference numeral (g), the textured vehicle generated in reference numeral (c) is placed at random positions of the background image, that is, the vehicle is placed on the background road surface, and images of multiple perspectives are synthesized. And simultaneously, obtaining two-dimensional labeling information and three-dimensional labeling information corresponding to the synthesized image.

Fig. 7 is a schematic view of data diversity effect of an image synthesizing method according to another embodiment of the present disclosure.

The various reference numerals in fig. 7 are as follows:

the reference numeral (a 1) denotes a real image (Input Real Images in AD) input in the automated driving system;

the reference numeral (b 1) denotes texture map completion (Inpainted Texture Maps);

the reference numeral (c 1) denotes a three-dimensional deformable template (3D Deformed Vehicle Models) of the vehicle;

reference numeral (d 1) denotes an output image (Output Images with various params) containing rich parameters.

As indicated by the reference numeral (a 1), a real image input in the automated driving system is taken as an image including a first angle of view of a first target object. As indicated by reference numeral (b 1), a texture complement process is performed on an image including a first view angle of a first target object to obtain a texture map of the first target object. As indicated by reference numeral (c 1), a three-dimensional model of the first target object may be generated from the deformation parameters and texture map of the deformable template. Wherein, different appearance shapes of the vehicle in the deformable template can be randomly and randomly combined with the texture map, and a large number of three-dimensional models of the vehicle with different appearance shapes and textures can be generated. The output of the generative model is shown as (d 1).

The image synthesis method of the embodiment of the disclosure can ensure the diversity and the verisimilitude of the generated data. As shown in fig. 6 and 7, embodiments of the present disclosure recover a texture map of a vehicle from images of a truly acquired traffic scene. And then, adjusting deformation parameters of the three-dimensional model to obtain a large number of three-dimensional vehicles with different shapes. And then, randomly combining the texture map with the three-dimensional vehicles with different shapes, and performing multi-view rendering. During the rendering process, different camera parameters (internal and external) and scene illumination information can also be adjusted, as well as the resolution of the generated image. The above method can increase the diversity of data as much as possible while ensuring the image quality.

Fig. 8 is a schematic diagram of an image synthesizing apparatus according to an embodiment of the present disclosure. Referring to fig. 8, the image synthesizing apparatus includes:

a processing unit 100, configured to perform texture complement processing on an image including a first view angle of a first target object, to obtain a texture map of the first target object;

a generating unit 200 for generating a three-dimensional model of the first target object using the texture map;

a projection unit 300, configured to project the three-dimensional model of the first target object according to the azimuth information of the scene image at the second view angle to obtain a two-dimensional image of the first target object;

and a superposition unit 400, configured to superimpose the two-dimensional image of the first target object onto the scene image, so as to obtain a composite image of the second viewing angle.

In one embodiment, the processing unit 100 is configured to:

segmenting an image comprising a first perspective of a first target object to obtain a segmented image comprising at least one component of the first target object;

marking the pose of the first target object in an image of a first visual angle comprising the first target object to obtain pose marking information;

projecting the segmented image according to the pose labeling information to obtain an image to be processed of the first target object;

and performing texture complement processing on the image to be processed by using the deep neural network to obtain a texture map of the first target object.

In one embodiment, the generating unit 200 is configured to:

obtaining deformation parameters of a deformable template of the first target object, wherein the deformation parameters correspond to the appearance shape of the first target object;

and generating a three-dimensional model of the first target object according to the deformation parameters of the deformable template and the texture map.

In one embodiment, the projection unit 300 is configured to:

obtaining azimuth information of the scene image according to shooting parameters of the scene image at the second view angle, wherein the azimuth information comprises a plane equation of the scene image;

adjusting the pose of the three-dimensional model of the first target object, and putting the three-dimensional model of the first target object on a plane determined by a plane equation;

and projecting the released three-dimensional model of the first target object to obtain a two-dimensional image of the first target object.

Fig. 9 is a schematic diagram of an image synthesizing apparatus according to another embodiment of the present disclosure. As shown in fig. 9, in one embodiment, the apparatus further includes a repairing unit 220, where the repairing unit 220 is configured to:

removing a second target object in the photographed standby image at the second view angle by using an image restoration method;

and taking the standby image from which the second target object is removed as a scene image of the second visual angle.

In one embodiment, the apparatus further includes an labeling unit 500, where the labeling unit 500 is configured to:

The functions of each unit in the image synthesizing apparatus according to the embodiments of the present disclosure may be referred to the corresponding descriptions in the above methods, and are not described herein.

According to embodiments of the present disclosure, the present disclosure also provides an electronic device, a readable storage medium and a computer program product.

Fig. 10 illustrates a schematic block diagram of an example electronic device 800 that may be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smartphones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 10, the apparatus 800 includes a computing unit 801 that can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM) 802 or a computer program loaded from a storage unit 808 into a Random Access Memory (RAM) 803. In the RAM803, various programs and data required for the operation of the device 800 can also be stored. The computing unit 801, the ROM 802, and the RAM803 are connected to each other by a bus 804. An input output (I/O) interface 805 is also connected to the bus 804.

Various components in device 800 are connected to I/O interface 805, including: an input unit 806 such as a keyboard, mouse, etc.; an output unit 807 such as various types of displays, speakers, and the like; a storage unit 808, such as a magnetic disk, optical disk, etc.; and a communication unit 809, such as a network card, modem, wireless communication transceiver, or the like. The communication unit 809 allows the device 800 to exchange information/data with other devices via a computer network such as the internet and/or various telecommunication networks.

The computing unit 801 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of computing unit 801 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, etc. The computing unit 80 performs the respective methods and processes described above, such as an image synthesizing method. For example, in some embodiments, the image synthesis method may be implemented as a computer software program tangibly embodied on a machine-readable medium, such as the storage unit 808. In some embodiments, part or all of the computer program may be loaded and/or installed onto device 800 via ROM 802 and/or communication unit 809. When a computer program is loaded into the RAM803 and executed by the computing unit 801, one or more steps of the image synthesis method described above may be performed. Alternatively, in other embodiments, the computing unit 801 may be configured to perform the image synthesis method by any other suitable means (e.g. by means of firmware).

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuit systems, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), systems On Chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs, the one or more computer programs may be executed and/or interpreted on a programmable system including at least one programmable processor, which may be a special purpose or general-purpose programmable processor, that may receive data and instructions from, and transmit data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for carrying out methods of the present disclosure may be written in any combination of one or more programming languages. These program code may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus such that the program code, when executed by the processor or controller, causes the functions/operations specified in the flowchart and/or block diagram to be implemented. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and pointing device (e.g., a mouse or trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic input, speech input, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), and the internet.

The computer system may include a client and a server. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other.

It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps recited in the present disclosure may be performed in parallel, sequentially, or in a different order, provided that the desired results of the disclosed aspects are achieved, and are not limited herein.

The above detailed description should not be taken as limiting the scope of the present disclosure. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives are possible, depending on design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present disclosure are intended to be included within the scope of the present disclosure.

Claims

1. An image synthesis method, comprising:

overlapping the two-dimensional image of the first target object into the scene image to obtain a composite image of a second view angle;

the performing texture complement processing on the image including the first view angle of the first target object to obtain a texture map of the first target object includes: dividing the image comprising the first view angle of the first target object to obtain a divided image comprising at least one component of the first target object; marking the pose of the first target object in the image comprising the first visual angle of the first target object to obtain pose marking information; projecting the segmented image according to the pose labeling information to obtain a to-be-processed image of the first target object; and performing texture complement processing on the image to be processed by using the deep neural network to obtain a texture map of the first target object.

2. The method of claim 1, wherein the generating a three-dimensional model of a first target object using the texture map comprises:

obtaining deformation parameters of a deformable template of a first target object, wherein the deformation parameters correspond to the appearance shape of the first target object;

3. The method according to any one of claims 1 to 2, wherein projecting the three-dimensional model of the first target object from the orientation information of the scene image at the second perspective to obtain a two-dimensional image of the first target object comprises:

obtaining azimuth information of a scene image according to shooting parameters of the scene image at a second view angle, wherein the azimuth information comprises a plane equation of the scene image;

adjusting the pose of the three-dimensional model of the first target object, and putting the three-dimensional model of the first target object on a plane determined by the plane equation;

4. The method of any one of claims 1 to 2, the method further comprising:

and taking the standby image from which the second target object is removed as the scene image of the second visual angle.

5. The method of any one of claims 1 to 2, the method further comprising:

6. An image synthesizing apparatus comprising:

a generating unit, configured to generate a three-dimensional model of a first target object using the texture map;

the superposition unit is used for superposing the two-dimensional image of the first target object into the scene image to obtain a composite image of a second visual angle;

wherein the processing unit is used for: dividing the image comprising the first view angle of the first target object to obtain a divided image comprising at least one component of the first target object; marking the pose of the first target object in the image comprising the first visual angle of the first target object to obtain pose marking information; projecting the segmented image according to the pose labeling information to obtain a to-be-processed image of the first target object; and performing texture complement processing on the image to be processed by using the deep neural network to obtain a texture map of the first target object.

7. The apparatus of claim 6, wherein the generating unit is configured to:

8. The apparatus of any of claims 6 to 7, wherein the projection unit is configured to:

9. The apparatus according to any one of claims 6 to 7, further comprising a repair unit for:

10. The apparatus according to any one of claims 6 to 7, further comprising an annotating unit for:

11. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-5.

12. A non-transitory computer readable storage medium storing computer instructions for causing a computer to perform the method of any one of claims 1-5.