CN117152352A

CN117152352A - Image processing method, deep learning model training method and device

Info

Publication number: CN117152352A
Application number: CN202311016283.4A
Authority: CN
Inventors: 周航; 徐志良; 朱家树; 梁柏荣; 刘经拓
Original assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Current assignee: Beijing Baidu Netcom Science and Technology Co Ltd
Priority date: 2023-08-11
Filing date: 2023-08-11
Publication date: 2023-12-01

Abstract

The disclosure provides an image processing method, a deep learning model training method and a device, relates to the technical field of artificial intelligence, in particular to the technical fields of computer vision, augmented reality, virtual reality, deep learning and the like, and can be applied to scenes in the metauniverse. The specific implementation scheme of the image processing method is as follows: performing key point alignment operation on an image to be processed based on a preset image to obtain a target two-dimensional image; extracting image features of a target two-dimensional image, wherein the image features represent target features matched with reconstruction parameters for three-dimensionally reconstructing an image to be processed; identifying the image characteristics to obtain target parameters for three-dimensionally reconstructing the image to be processed; and carrying out three-dimensional reconstruction on the target two-dimensional image based on the target parameters to obtain a target three-dimensional image.

Description

Image processing method, deep learning model training method and device

Technical Field

The disclosure relates to the technical field of artificial intelligence, in particular to the technical fields of computer vision, augmented reality, virtual reality, deep learning and the like, and specifically relates to an image processing method, a deep learning model training method and a device.

Background

The three-dimensional reconstruction based on monocular vision refers to a process of deriving depth information of an image through characteristics of a single or a plurality of two-dimensional images and reconstructing the three-dimensional image according to the depth information.

Along with the wide application of the three-dimensional reconstruction technology in the fields of film and television special effects, virtual images, AR (Augmented Reality ), VR (Virtual Reality) and the like, the requirements on the reconstruction precision of the three-dimensional image are also increasing.

Disclosure of Invention

The disclosure provides an image processing method, a deep learning model training method and a device.

According to an aspect of the present disclosure, there is provided an image processing method including: performing key point alignment operation on an image to be processed based on a preset image to obtain a target two-dimensional image; extracting image features of a target two-dimensional image, wherein the image features represent target features matched with reconstruction parameters for three-dimensionally reconstructing an image to be processed; identifying the image characteristics to obtain target parameters for three-dimensionally reconstructing the image to be processed; and carrying out three-dimensional reconstruction on the target two-dimensional image based on the target parameters to obtain a target three-dimensional image.

According to another aspect of the present disclosure, there is provided a training method of a deep learning model, including: the following operations are performed on the sample two-dimensional image using the initial model: performing key point alignment operation on the sample two-dimensional image based on the preset image to obtain a target sample two-dimensional image; carrying out local mask processing on the two-dimensional image of the target sample to obtain a mask image; extracting mask image features of the mask image and sample image features of the target sample two-dimensional image; identifying mask image features to obtain target sample parameters for three-dimensionally reconstructing a two-dimensional image of a sample; based on the target sample parameters, carrying out three-dimensional reconstruction on the target sample two-dimensional image to obtain a sample three-dimensional image corresponding to the sample two-dimensional image; obtaining a loss value according to the mask image characteristics, the sample two-dimensional image and the sample three-dimensional image based on the target loss function; and adjusting model parameters of the initial model based on the loss value to obtain a trained deep learning model.

According to another aspect of the present disclosure, there is provided an image processing apparatus including: the device comprises a first alignment module, a first feature extraction module, a first feature identification module and a first feature reconstruction module. The first alignment module is used for executing key point alignment operation on the image to be processed based on the preset image to obtain a target two-dimensional image. The first feature extraction module is used for extracting image features of the target two-dimensional image, wherein the image features represent target features matched with reconstruction parameters used for three-dimensionally reconstructing the image to be processed. The first feature recognition module is used for recognizing the image features to obtain target parameters for three-dimensionally reconstructing the image to be processed. The first feature reconstruction module is used for carrying out three-dimensional reconstruction on the target two-dimensional image based on the target parameters to obtain a target three-dimensional image.

According to another aspect of the present disclosure, there is provided a training apparatus of a deep learning model, including: the device comprises a second alignment module, a mask module, a second feature extraction module, a second feature identification module, a second feature reconstruction module, a loss calculation module and an adjustment module. And the second alignment module is used for executing key point alignment operation on the sample two-dimensional image based on the preset image to obtain the target sample two-dimensional image. And the mask module is used for carrying out local mask processing on the target sample two-dimensional image to obtain a mask image. And the second feature extraction module is used for extracting mask image features of the mask image and sample image features of the target sample two-dimensional image. And the second feature recognition module is used for recognizing the mask image features to obtain target sample parameters for three-dimensionally reconstructing the two-dimensional image of the sample. And the second characteristic reconstruction module is used for carrying out three-dimensional reconstruction on the two-dimensional image of the target sample based on the parameters of the target sample to obtain a sample three-dimensional image corresponding to the two-dimensional image of the sample. And the loss calculation module is used for obtaining a loss value according to the mask image characteristics, the sample two-dimensional image and the sample three-dimensional image based on the target loss function. And the adjusting module is used for adjusting the model parameters of the initial model based on the loss value to obtain a trained deep learning model.

According to another aspect of the present disclosure, there is provided an electronic device including: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor, the instructions being executable by the at least one processor to enable the at least one processor to perform the method as described above.

According to another aspect of the present disclosure, there is provided a non-transitory computer-readable storage medium storing computer instructions for causing the computer to perform the method as above.

According to another aspect of the present disclosure, there is provided a computer program product comprising a computer program which, when executed by a processor, implements a method as above.

It should be understood that the description in this section is not intended to identify key or critical features of the embodiments of the disclosure, nor is it intended to be used to limit the scope of the disclosure. Other features of the present disclosure will become apparent from the following specification.

Drawings

The drawings are for a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:

FIG. 1 schematically illustrates an exemplary system architecture of a training method and apparatus to which an image processing method or a deep learning model may be applied, according to an embodiment of the present disclosure;

FIG. 2 schematically illustrates a flow chart of an image processing method according to an embodiment of the disclosure;

FIG. 3 schematically illustrates a schematic diagram of performing a keypoint alignment process on an image to be processed based on a predetermined image in accordance with an embodiment of the present disclosure;

fig. 4 schematically illustrates a schematic diagram of an image processing method according to an embodiment of the present disclosure;

FIG. 5 schematically illustrates a flow chart of a training method of a deep learning model according to an embodiment of the present disclosure;

FIG. 6 schematically illustrates a framework diagram of a training method of a deep learning model according to an embodiment of the present disclosure;

fig. 7 schematically shows a block diagram of an image processing apparatus according to an embodiment of the present disclosure;

FIG. 8 schematically illustrates a block diagram of a training apparatus of a deep learning model according to an embodiment of the present disclosure; and

fig. 9 schematically illustrates a block diagram of an electronic device adapted to implement an image processing method according to an embodiment of the disclosure.

Detailed Description

Exemplary embodiments of the present disclosure are described below in conjunction with the accompanying drawings, which include various details of the embodiments of the present disclosure to facilitate understanding, and should be considered as merely exemplary. Accordingly, one of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.

The three-dimensional reconstruction method in the related art mainly comprises the following two steps: 1. performing three-dimensional reconstruction based on the virtual three-dimensional data and the three-dimensional model; 2. an unsupervised learning training model is performed based on two-dimensional image data, for example: the detail representation captures and animates the model (Detailed Expression Capture and Animation, DECA for short).

However, the virtual three-dimensional data acquisition difficulty is high, the reconstruction parameters output by the unsupervised learning training model based on the two-dimensional image data have larger deviation from the three-dimensional parameters of the initial image, the texture fitting has larger difference, and the reconstruction effect is poor.

In view of this, an embodiment of the present disclosure provides an image processing method, which extracts image features matched with reconstruction parameters for three-dimensionally reconstructing an image to be processed, and identifies the image features to obtain more accurate reconstruction parameters. And reconstructing the two-dimensional image based on the reconstruction parameters, so that the reconstruction accuracy of the image can be improved.

Fig. 1 schematically illustrates an exemplary system architecture of a training method and apparatus to which an image processing method or a deep learning model may be applied according to an embodiment of the present disclosure.

It should be noted that fig. 1 is only an example of a system architecture to which embodiments of the present disclosure may be applied to assist those skilled in the art in understanding the technical content of the present disclosure, but does not mean that embodiments of the present disclosure may not be used in other devices, systems, environments, or scenarios. For example, in another embodiment, an exemplary system architecture to which the image processing method and apparatus may be applied may include a terminal device, but the terminal device may implement the image processing method and apparatus provided by the embodiments of the present disclosure without interacting with a server.

As shown in fig. 1, a system architecture 100 according to this embodiment may include a first terminal device 101, a second terminal device 102, a third terminal device 103, a network 104, and a server 105. The network 104 is a medium used to provide a communication link between the first terminal device 101, the second terminal device 102, the third terminal device 103, and the server 105. The network 104 may include various connection types, such as wired and/or wireless communication links, and the like.

The user may interact with the server 105 via the network 104 using the first terminal device 101, the second terminal device 102, the third terminal device 103, to receive or send messages etc. Various communication client applications, such as a knowledge reading class application, a web browser application, a search class application, an instant messaging tool, a mailbox client and/or social platform software, etc. (by way of example only) may be installed on the first terminal device 101, the second terminal device 102, the third terminal device 103.

The first terminal device 101, the second terminal device 102, the third terminal device 103 may be various electronic devices having a display screen and supporting web browsing, including but not limited to smartphones, tablets, laptop and desktop computers, and the like.

The server 105 may be a server providing various services, such as a background management server (merely an example) providing support for content browsed by the user with the first terminal apparatus 101, the second terminal apparatus 102, the third terminal apparatus 103. The background management server may analyze and process the received data such as the user request, and feed back the processing result (e.g., the web page, information, or data obtained or generated according to the user request) to the terminal device.

It should be noted that, the image processing method or the training method of the deep learning model provided in the embodiments of the present disclosure may be generally executed by the first terminal device 101, the second terminal device 102, and the third terminal device 103. Accordingly, the image processing apparatus provided by the embodiments of the present disclosure may also be provided in the first terminal device 101, the second terminal device 102, and the third terminal device 103.

Alternatively, the image processing method provided by the embodiments of the present disclosure may be generally performed by the server 105. Accordingly, the image processing apparatus provided by the embodiments of the present disclosure may be generally provided in the server 105. The image processing method provided by the embodiment of the present disclosure may also be performed by a server or a server cluster that is different from the server 105 and is capable of communicating with the first terminal device 101, the second terminal device 102, the third terminal device 103, and/or the server 105. Accordingly, the image processing apparatus provided by the embodiments of the present disclosure may also be provided in a server or a server cluster that is different from the server 105 and is capable of communicating with the first terminal device 101, the second terminal device 102, the third terminal device 103, and/or the server 105.

For example, the first terminal device 101, the second terminal device 102, and the third terminal device 103 may acquire an image to be processed, then send the acquired image to be processed to the server 105, and the server 105 performs a key point alignment operation on the image to be processed based on a predetermined image, so as to obtain a target two-dimensional image; extracting image features of a target two-dimensional image, wherein the image features represent target features matched with reconstruction parameters for three-dimensionally reconstructing an image to be processed; identifying the image characteristics to obtain target parameters for three-dimensionally reconstructing the image to be processed; and carrying out three-dimensional reconstruction on the target two-dimensional image based on the target parameters to obtain a target three-dimensional image. Or by a server or server cluster capable of communicating with the first terminal device 101, the second terminal device 102, the third terminal device 103 and/or the server 105, performing operations such as alignment, image feature extraction, feature recognition, three-dimensional reconstruction and the like on the image to be processed, and finally obtaining a target three-dimensional image.

It should be understood that the number of terminal devices, networks and servers in fig. 1 is merely illustrative. There may be any number of terminal devices, networks, and servers, as desired for implementation.

In the technical scheme of the disclosure, the related processes of collecting, storing, using, processing, transmitting, providing, disclosing, applying and the like of the personal information of the user all conform to the regulations of related laws and regulations, necessary security measures are adopted, and the public order harmony is not violated.

In the technical scheme of the disclosure, the authorization or consent of the user is obtained before the personal information of the user is obtained or acquired.

Fig. 2 schematically shows a flowchart of an image processing method according to an embodiment of the present disclosure.

As shown in fig. 2, the method includes operations S210 to S240.

In operation S210, a key point alignment operation is performed on an image to be processed based on a predetermined image, resulting in a target two-dimensional image.

In operation S220, image features of a target two-dimensional image are extracted.

In operation S230, the image features are identified, resulting in target parameters for three-dimensional reconstruction of the image to be processed.

In operation S240, the target two-dimensional image is three-dimensionally reconstructed based on the target parameters, to obtain a target three-dimensional image.

According to an embodiment of the present disclosure, the image to be processed may be a two-dimensional image acquired for a target site of the target object. The target site may be any site of the target object, for example: may be a face. The predetermined image may be an image of a predetermined pixel size containing keypoints of the target site. For example: in the case where the target site is a face, the key point of the target site may be the eyes or nose of the target object. The predetermined pixel size is 256×256 pixels.

According to an embodiment of the present disclosure, the performing the keypoint alignment operation on the image to be processed may be to locate the keypoint coordinates in the predetermined image on the same horizontal line as or coincide with the keypoint coordinates of the image to be processed.

For example: in the image to be processed, the key point of the target object may be the pupil center point O of the target object ₁ (m，y ₁ ) The predetermined image may include a plurality of key points of different parts of the target object, and a standard point O corresponding to a pupil center point in the image to be processed may be determined from the plurality of key points ₂ (n，y ₂ ) The pupil center point O can be achieved by adjusting the image to be processed ₁ (m，y ₁ ) And standard point O ₂ (n，y ₂ ) And overlapping to obtain a target two-dimensional image.

According to embodiments of the present disclosure, a two-dimensional image of a target may be input into a trained visual morphing network (Vision Transformer, viT for short) to obtain image features. The image features characterize target features that match reconstruction parameters for three-dimensionally reconstructing the image to be processed.

According to the embodiment of the disclosure, the reconstruction parameters for three-dimensionally reconstructing the image to be processed may be n, and correspondingly, the image features may be identified by n linear layers, each linear layer outputting one three-dimensional reconstruction parameter.

According to embodiments of the present disclosure, a 3DMM model (3D Morphable Model,3D deformation statistical model) may be utilized to reconstruct a three-dimensional image of a target based on target parameters. The target parameters may include expression parameters, texture parameters, illumination parameters, camera parameters, identity parameters.

According to the embodiment of the disclosure, the more accurate reconstruction parameters are obtained by extracting the image characteristics matched with the reconstruction parameters for reconstructing the image to be processed in three dimensions and identifying the image characteristics. And reconstructing the two-dimensional image based on the reconstruction parameters, so that the reconstruction accuracy of the image can be improved.

According to an embodiment of the present disclosure, performing a key point alignment operation on an image to be processed based on a predetermined image to obtain a target two-dimensional image may include the following operations: performing key point detection on an image to be processed to obtain a first key point group; performing key point detection on the preset image to obtain a second key point group; and performing alignment operation on the image to be processed based on the corresponding relation between the key points in the first key point group and the second key point group to obtain a target two-dimensional image.

For example: and performing key point detection on the image to be processed by using a key point detection algorithm, wherein the obtained first key point group can comprise first coordinates of points of key parts such as eyes, nose, mouth and the like. Similarly, the key point detection algorithm is used for detecting the key points of the preset image, and the obtained second key point group can comprise second coordinates of points of key parts such as eyes, nose, mouth and the like.

According to an embodiment of the present disclosure, the predetermined image may be an image of a predetermined pixel size employed in training the visual deformation network.

According to embodiments of the present disclosure, correspondence between keypoints in the first and second keypoint groups may characterize correspondence between different types of keypoints, for example: the keypoints of the eye parts of the target object in the image to be processed correspond to the keypoints of the eye parts of the predetermined object in the predetermined image. The key points of the nose part of the target object in the image to be processed correspond to the key points of the nose part of the predetermined object in the predetermined image.

According to embodiments of the present disclosure, a keypoint alignment algorithm may be utilized, for example: the Procludes analysis algorithm performs an alignment operation on the image to be processed to obtain a target two-dimensional image.

According to an embodiment of the present disclosure, based on a correspondence between keypoints in a first keypoint group and a second keypoint group, performing an alignment operation on an image to be processed to obtain a target two-dimensional image may include the following operations: obtaining position offset information between key points with corresponding relation from the first key point group and the second key point group based on the corresponding relation between the key points in the first key point group and the second key point group; and scaling the image to be processed according to the position offset information to obtain a target two-dimensional image.

For example: the first set of keypoints may include three keypoints of the eye portion of the target object. The three keypoints may include two keypoints at the head and tail ends of the eye and a keypoint of the pupil position of the eye. The second set of keypoints may comprise a plurality of keypoints of the eye portion of the target object. The plurality of keypoints may include a plurality of keypoints on the upper eyelid, a plurality of keypoints on the lower eyelid, and keypoints of eye pupil position.

Since there is a certain difference in the eye shape of different target objects, there is a difference in the positions of the key points at the head and tail ends of the eyes, so that the key point closest to the head or tail end of the eye in the first key point group among the plurality of key points on the upper eyelid in the second key point group can be determined as the key point having a corresponding relationship with the first key point group. And based on the position offset information between the key point and the key point of the head end or the tail end of the eye in the first key point group, the position offset information between the first key point group and the second key point group is used.

According to an embodiment of the present disclosure, the positional shift information between the first key point group and the second key point group may also be an average value of positional shifts between a plurality of key points having a correspondence relationship, or a weighted sum value.

According to the embodiment of the disclosure, based on the position offset information, the image to be processed may be scaled such that the key points in the image to be processed coincide with the key points in the predetermined image or are located at the same horizontal position.

Fig. 3 schematically illustrates a schematic diagram of performing a key point alignment process on an image to be processed based on a predetermined image according to an embodiment of the present disclosure.

As shown in fig. 3, in the implementationIn example 300, the image 311 to be processed is subjected to key point detection to obtain a result including P ₁ 、P ₂ 、P ₃ A first set of three keypoints 313. Keypoint detection of predetermined image 312, resulting in a composition comprising a sum P ₁ 、P ₂ 、P ₃ Corresponding m ₁ 、m ₂ 、m ₃ A second set of three keypoints 314. Based on correspondence between key points, i.e. P ₁ Corresponds to m ₁ 、P ₂ Corresponds to m ₂ 、P ₃ Corresponds to m ₃ A keypoint location offset 315 is obtained. The image 311 to be processed is scaled based on the keypoint location offset 315 to obtain a target two-dimensional image 316 that is aligned with the keypoints of the predetermined image 312.

According to the embodiment of the disclosure, the image features of the extracted target two-dimensional image are aligned with the image features of the sample two-dimensional image extracted during visual deformation network training by executing the key point alignment operation on the image to be processed, and the accurate three-dimensional reconstruction parameters are obtained by identifying the image features.

According to an embodiment of the present disclosure, extracting image features of a target two-dimensional image may include the operations of: dividing the target two-dimensional image according to a preset pixel size to obtain a plurality of pixel blocks; and processing the plurality of pixel blocks based on a self-attention mechanism to obtain image features.

For example: the size of the target two-dimensional image may be 256×256 pixels, the predetermined pixel size may be 16×16 pixels, and pixel division of the target two-dimensional image according to the predetermined pixel size may result in 256 pixel blocks. Since the key points of the target two-dimensional image are aligned with the key points of the predetermined image in the foregoing description, when the target two-dimensional image is subjected to pixel segmentation, the target two-dimensional image is also subjected to semantic segmentation, and the obtained 256 pixel blocks are pixel blocks including the local image characteristics of the target part of the target object.

According to the embodiment of the disclosure, a plurality of pixel blocks can be input into a trained visual deformation network according to the input format of the visual deformation network, each pixel block corresponds to a local image feature, and an image feature vector formed by the local image features of the plurality of pixel blocks can be obtained.

According to the embodiment of the disclosure, in the visual deformation network, the local image characteristic of each pixel block can correspond to one weight parameter, and the weight parameter of the visual deformation network is continuously learned and updated in the training process, so that for the trained visual deformation network, the image characteristic matched with the reconstruction parameter can be extracted from the target two-dimensional image, thereby realizing the technical effect of improving the reconstruction precision.

According to the embodiment of the disclosure, the visual deformation network can extract the image characteristics capable of accurately predicting the three-dimensional reconstruction parameters based on the local image characteristics of each pixel block by carrying out pixel segmentation on the target two-dimensional image, so that the three-dimensional reconstruction accuracy is improved.

According to an embodiment of the present disclosure, identifying image features to obtain target parameters for three-dimensional reconstruction of an image to be processed may include the following operations: carrying out expression parameter identification on the image characteristics to obtain target expression parameters; carrying out texture parameter identification on the image characteristics to obtain target texture parameters; carrying out illumination parameter identification on the image characteristics to obtain target illumination parameters; identifying acquisition equipment parameters of the image characteristics to obtain target acquisition equipment parameters; and carrying out identity parameter identification on the image characteristics to obtain target identity parameters.

According to embodiments of the present disclosure, for a 3DMM model, the expression parameters may characterize facial deformation of a target object in a two-dimensional image. The texture parameters may characterize the facial contours of the target object in the two-dimensional image. The illumination parameters may be used to compensate for the effect of the illumination intensity in the two-dimensional image on the texture parameters. The parameters of the image acquisition equipment can be used for compensating acquisition errors when different image acquisition equipment acquires images of the target object. The identity parameter may characterize the identity of the target object.

According to embodiments of the present disclosure, the target parameters may be understood as weighted values in the 3DMM model with respect to identity, expression, texture, illumination. Each dimension parameter of the 3DMM model controls local changes of the face.

According to embodiments of the present disclosure, one CNN (Convolutional Neural Network ) may be built for each parameter, with recognition of the features being achieved through training.

It should be noted that, the local image features of the multiple pixel blocks may be first subjected to feature stitching to obtain an image feature vector, and then the image feature vector is input into five CNN networks in parallel to perform feature recognition to obtain the target parameter.

According to the embodiment of the disclosure, the reconstruction parameters obtained by identifying the image features are integrated, so that the accuracy of the reconstruction parameters is improved, compared with the mode of carrying out reconstruction parameter estimation based on the image features of the key points in the related art, the information in the image features is kept complete and comprehensive.

Fig. 4 schematically illustrates a schematic diagram of an image processing method according to an embodiment of the present disclosure.

As shown in fig. 4, in embodiment 400, a target two-dimensional image 316 is pixel-segmented by a predetermined pixel size to obtain a segmented image 411. The divided image 411 includes a plurality of pixel blocks 421, and the plurality of pixel blocks 421 are input into a visual morphing network (Vision Transformer) to output an image feature 423. The image features are respectively input into an expression parameter identification network 431, a texture parameter identification network 432, an identity parameter identification network 433, an image acquisition device parameter identification network 434 and an illumination parameter identification network 435, and expression parameters, texture parameters, identity parameters, image acquisition device parameters and illumination parameters are output. Based on the expression parameters, texture parameters, identity parameters, image acquisition equipment parameters and illumination parameters, 3D Morphable models (3D deformation statistical model) 441 is utilized for three-dimensional reconstruction, and a target three-dimensional image 442 is obtained.

According to an embodiment of the present disclosure, based on a target parameter, performing three-dimensional reconstruction on an image to be processed to obtain a target three-dimensional image may include the following operations: processing the image to be processed based on the target texture parameters to obtain a target texture image; processing the image to be processed based on the target acquisition equipment parameters, the target expression parameters and the target identity parameters to obtain target three-dimensional point cloud data; and rendering the target three-dimensional point cloud data based on the target illumination parameters and the target texture image to obtain a target three-dimensional image.

For example: the 3DMM network can be utilized to process the image to be processed based on the target texture parameters, and the texture map comprising texture information such as color, brightness and the like of the target object in the image to be processed can be obtained. Based on the target expression parameter, the target identity parameter and the target acquisition equipment parameter, three-dimensional point cloud data of a target object in the image to be processed can be obtained, and the three-dimensional point cloud data can be understood as a face 3D model of the target object consisting of the three-dimensional point cloud.

According to the embodiment of the disclosure, a differentiable renderer can be utilized to render from vertex data in the target three-dimensional point cloud data based on illumination parameters and the target texture image, so as to obtain a target three-dimensional image.

It should be noted that, based on the target parameter, the technology of performing three-dimensional reconstruction by using the 3DMM network is a mature technology, which is not described herein.

According to the embodiment of the disclosure, the target parameters are obtained based on the image features extracted by the visual deformation network, so that the image information in the relatively complete target two-dimensional image is covered, and the target three-dimensional image can be relatively accurately reconstructed.

Fig. 5 schematically illustrates a flowchart of a training method of a deep learning model according to an embodiment of the present disclosure.

As shown in fig. 5, the training method 500 may include performing the following operations S510 to S570 on a sample two-dimensional image using an initial model.

In operation S510, a key point alignment operation is performed on the sample two-dimensional image based on the predetermined image, resulting in a target sample two-dimensional image.

In operation S520, a partial masking process is performed on the target sample two-dimensional image to obtain a masked image.

In operation S530, mask image features of the mask image and sample image features of the target sample two-dimensional image are extracted.

In operation S540, the mask image features are identified, resulting in target sample parameters for three-dimensionally reconstructing a two-dimensional image of the sample.

In operation S550, the target sample two-dimensional image is three-dimensionally reconstructed based on the target sample parameters, and a sample three-dimensional image corresponding to the sample two-dimensional image is obtained.

In operation S560, a loss value is obtained from the mask image feature, the sample two-dimensional image, and the sample three-dimensional image based on the target loss function.

In operation S570, model parameters of the initial model are adjusted based on the loss values, resulting in a trained deep learning model.

According to the embodiment of the disclosure, the performing of the key point alignment operation on the sample two-dimensional image based on the predetermined image, and the process of performing the key point alignment operation on the target sample two-dimensional image based on the predetermined image in the image processing method described above are the same, and are not described herein.

According to the embodiment of the disclosure, 50% of the area of the two-dimensional image of the target sample can be randomly and locally masked, so as to obtain a mask image. For example: the masked region is assigned a special label (token), which is a learnable parameter in the model training process.

According to embodiments of the present disclosure, the initial model may include a feature extraction module, a feature recognition module, and a feature reconstruction module. The feature extraction module may be a visual morphing network, the feature recognition module may include 5 CNN networks in parallel, and the feature reconstruction module may be a 3DMM network. The output end of the feature extraction module is connected with the input ends of 5 CNN networks in parallel so as to input the extracted mask image features into the 5 CNN networks in parallel and output 5 target sample parameters (expression parameters, texture parameters, identity parameters, image acquisition equipment parameters and illumination parameters). Inputting 5 target sample parameters into a 3DMM network, and carrying out three-dimensional reconstruction on a target sample two-dimensional image to obtain a sample three-dimensional image.

According to an embodiment of the present disclosure, the objective loss function may include any one of the following: cross entropy loss function, L1 loss function.

According to embodiments of the present disclosure, a loss function based on mask image features and sample image features may be understood as a mask restoration loss function. In the training process, the model parameters of the visual deformation network are continuously adjusted, so that when the visual deformation network is utilized to extract the image features, the similarity between the local image features at the mask position and the sample image features at the same position corresponding to the mask position is larger, the similarity between the local image features at the mask position and the sample image features at other positions not corresponding to the mask position is smaller, and the visual deformation network can extract more complete image information.

According to embodiments of the present disclosure, a loss function based on a sample two-dimensional image and a sample three-dimensional image may be understood as a three-dimensional reconstruction loss function. The three-dimensional reconstruction loss function may employ an L1 loss function.

According to the embodiment of the disclosure, during training, the model parameters of the visual deformation network may be adjusted based on the loss value of the mask restoration loss function, and when the loss value of the mask restoration loss function reaches the convergence condition, the model parameters of the visual deformation network may be fixed. And then, based on the three-dimensional reconstruction loss function, adjusting model parameters of other modules in the initial model. The convergence condition may be set according to the requirements of the actual application scenario, for example: the loss value of the mask recovery loss function is less than a predetermined loss threshold, or may be up to a maximum number of iterative training times.

According to the embodiment of the disclosure, the sample image features and the mask image features are subjected to contrast learning by carrying out mask operation on the target sample two-dimensional image input into the visual deformation network, so that the deep learning model can learn more comprehensive image features in the training process, the information integrity of the image features is guaranteed, the mask image features are identified, the three-dimensional reconstruction parameters are obtained, the mask image features extracted by the deep learning model are matched with the three-dimensional reconstruction parameters, and the accuracy of the three-dimensional reconstruction parameters and the robustness and the comprehensiveness of the model are improved.

In the related art, an unsupervised training method is generally used for training a res net (residual error network) to reconstruct an image in three dimensions. However, the three-dimensional reconstruction parameters involve a large number of dimensions, and the fine differences of the parameters in each dimension cause large differences in the accuracy of the three-dimensional reconstructed image. Therefore, the reconstruction accuracy is lower based on the model obtained by ResNet network training.

According to an embodiment of the present disclosure, performing local masking processing on a two-dimensional image of a target sample to obtain a masked image may include the following operations: dividing the target two-dimensional image according to a preset pixel size to obtain a plurality of sample pixel blocks; and masking the plurality of sample pixel blocks randomly according to a predetermined masking proportion to obtain a masking image.

According to the embodiments of the present disclosure, the operation of dividing the target sample two-dimensional image by the predetermined pixel size is the same as the operation of dividing the target sample two-dimensional image by the predetermined pixel size described above, and will not be described herein.

According to an embodiment of the present disclosure, the predetermined mask ratio may be set according to the needs of an actual application scenario, for example: the pixel size of the target two-dimensional image may be 256×256 pixels, and the predetermined pixel size may be 16×16 pixels, and dividing the target sample two-dimensional image according to the predetermined pixel size may result in 256 pixel blocks. The predetermined mask ratio may be 50%, and 50% of the 256 pixel blocks may be randomly selected to mask, and the resulting mask image may include 128 original pixel blocks and 128 masked pixel blocks.

According to the embodiment of the disclosure, a visual deformation network (Vision Transformer) is introduced in the embodiment of the disclosure, based on the comparison and learning of mask features and sample image features at corresponding positions, mask features with high similarity to the sample image features are obtained, and feature recognition is performed based on the mask features with high similarity to obtain target parameters, so that the visual deformation network can extract image features matched with the target parameters, and the reconstruction accuracy of images is improved.

According to an embodiment of the present disclosure, performing a key point alignment operation on a sample two-dimensional image based on a predetermined image, obtaining a target two-dimensional image may include the operations of: performing key point detection on the sample two-dimensional image to obtain a first sample key point group; performing key point detection on a sample preset image to obtain a second sample key point group; and performing alignment operation on the sample two-dimensional image based on the correspondence between the key points in the first sample key point group and the second sample key point group to obtain a target sample two-dimensional image.

According to the embodiment of the disclosure, based on the predetermined image, the key point alignment operation is performed on the sample two-dimensional image, so that the operation of obtaining the target sample two-dimensional image is the same as the operation of performing the key point alignment operation on the image to be processed in the image processing method described above, so that the target two-dimensional image is obtained. The sample two-dimensional image and the target sample two-dimensional image are respectively the same as the definition range of the image to be processed and the target two-dimensional image, and are not described in detail herein.

For example: obtaining sample position offset information between key points with corresponding relation from the first sample key point group and the second sample key point group based on the corresponding relation between the key points in the first sample key point group and the second sample key point group; and scaling the sample two-dimensional image according to the sample position offset information to obtain a target sample two-dimensional image.

For example: the first sample key point group can be key points of nose part and can include key points P of nose tip ₂ Key points P on both sides of nose wing ₁ And P ₃ . The second sample keypoint group may comprise keypoints m at the tip of the nose at corresponding positions ₁ Key points m on both sides of nose wing ₁ And m ₃ 。

According to the embodiment of the disclosure, different weights can be configured for the key points at different positions, and as the nose shapes of different objects are different, the distance difference between the key points at two sides of the nose wing is larger, and when the key point alignment operation is executed, higher weights can be configured for the key points at the nose tip; lower weights are configured for keypoints on both sides of the nose wings. And carrying out weighted summation on the position offset among the key points based on the weight to obtain final sample position offset information. And scaling the sample two-dimensional image based on the final sample position offset information such that the keypoints of the target sample two-dimensional image are aligned with the keypoints of the predetermined image.

According to the embodiment of the disclosure, based on the alignment operation of the predetermined image and the sample two-dimensional image, so that after the sample two-dimensional image is segmented, the position of each pixel block corresponds to the position of the visual deformation network parameter, the mask image and the sample image can be subjected to contrast learning according to the corresponding positions under the condition of feature alignment.

According to an embodiment of the present disclosure, extracting mask image features of the mask image and sample image features of the target sample two-dimensional image may include the operations of: processing a plurality of sample pixel blocks before masking based on a self-attention mechanism to obtain sample image characteristics; and processing the masked plurality of sample pixel blocks based on a self-attention mechanism to obtain a mask image feature.

For example: the plurality of sample pixel blocks prior to masking may be input into a visual morphing network, outputting sample image features (f ₁ ，f ₂ ，…，f _n ). The masked plurality of sample pixel blocks are input into a visual morphing network, and masked image features (F ₁ ，F ₂ ，…，F _n ). The characteristics of each pixel block in the sample image characteristics correspond to the characteristics of each pixel block in the mask image characteristics, i.e. sample image characteristics F1 correspond to mask image characteristics F ₁ 。

According to an embodiment of the present disclosure, obtaining a loss value from a mask image feature, a sample two-dimensional image, and a sample three-dimensional image based on an objective loss function may include the operations of: based on the first loss function, obtaining a mask feature loss value according to mask image features and sample image features; and obtaining a reconstruction loss value according to the sample two-dimensional image and the sample three-dimensional image based on the second loss function.

For example: based on the similarity algorithm, a similarity value of the mask image feature and the sample image feature corresponding to each pixel block position can be obtained. For example: for pixel block at position 1, sample image feature f1 corresponds to the mask image featureSign F ₁ The first loss function may be constructed based on the similarity of the mask image features to the sample image features as follows:

wherein sg is []Represents the iteration termination condition S _cos The degree of similarity is indicated and,mask image features representing the i-th position on the mask image; />Sample image features representing the i-th position on the target sample two-dimensional image.

According to an embodiment of the present disclosure, obtaining a reconstruction loss value from a sample two-dimensional image and a sample three-dimensional image based on a second loss function may include the operations of: projecting the sample three-dimensional image to obtain a two-dimensional projection image corresponding to the sample three-dimensional image; and obtaining a reconstruction loss value according to the two-dimensional image and the two-dimensional projection image of the sample based on the second loss function.

According to an embodiment of the present disclosure, the second loss function may be an L1 loss function.

For example: the differential renderer can be utilized to project the sample three-dimensional image in a two-dimensional space, and a two-dimensional projection image corresponding to the sample three-dimensional image is obtained. And calculating a reconstruction loss value of the two-dimensional projection image and the sample two-dimensional image based on the L1 loss function.

According to the embodiment of the disclosure, based on the first loss function, the mask image features output by the visual deformation network can be constrained so that the similarity between the mask image features and the sample image features at corresponding positions is higher, and the visual deformation network can extract the image feature information from the whole. And based on the second loss function, the reconstruction data of the 3D deformation statistical network during three-dimensional reconstruction can be constrained, so that the difference between the three-dimensional reconstruction data and the texture map is smaller, and the accuracy of the deep learning model is improved.

When the 3D deformation statistical network (3 DMM) is utilized, in order to further improve the precision of three-dimensional reconstruction data, key point loss calculation can be carried out on the three-dimensional point cloud data before the three-dimensional point cloud data is combined with the texture map, so that the difference between the three-dimensional point cloud data and the texture map is reduced.

According to an embodiment of the present disclosure, the training method may further include the following operations: performing key point detection on the sample two-dimensional image to obtain a first key point feature; processing the sample two-dimensional image based on the target sample parameters to obtain three-dimensional point cloud data corresponding to the sample two-dimensional image; extracting second key point features from the three-dimensional point cloud data; and obtaining a key point feature loss value according to the first key point feature and the second key point feature based on the third loss function.

According to an embodiment of the present disclosure, the third loss function may be a similarity loss function based on the first keypoint feature and the second keypoint feature.

According to an embodiment of the present disclosure, the first keypoint feature may include keypoints as described previously when performing keypoint alignment on the sample two-dimensional image, for example: key points at the tip of the nose and key points at the two sides of the nose wing. In order to improve the accuracy of key point reconstruction in three-dimensional point cloud data, the number of key points in the first key point feature and the feature dimension are larger than those used when the key point alignment operation is performed. For example: the first keypoint feature may comprise features of a plurality of keypoints for a plurality of locations of eyes, nose, mouth, eyebrows, ears, etc.

According to the embodiment of the disclosure, when the third loss function is calculated, the more the number of key points in the key point features and the more feature dimensions of the key points are, the higher the three-dimensional reconstruction accuracy of the model obtained through training is.

According to the embodiment of the disclosure, before the three-dimensional point cloud data is combined with the texture map, key point loss calculation is performed on the three-dimensional point cloud data, so that the difference between the three-dimensional point cloud data and the texture map is reduced, and the reconstruction precision of the three-dimensional point cloud data is improved.

In the related art, when training a deep learning model applied to a three-dimensional reconstructed image, loss calculation is generally performed based on an output result of a final model and a sample label, model parameters of an overall model are adjusted, and the training period of the training mode is long and the training efficiency is low.

Thus, in embodiments of the present disclosure, the depth model may be trained in a gradient training manner.

According to an embodiment of the present disclosure, adjusting model parameters of an initial model based on a loss value to obtain a trained deep learning model may include the operations of: based on the mask feature loss value, adjusting parameters of a feature extraction module to obtain a first intermediate model; the same operation of the first intermediate model on the sample two-dimensional image is executed with the operation of the initial model on the sample two-dimensional image, so that a key point characteristic loss value of the first intermediate model is obtained; based on the key point feature loss value, adjusting parameters of a feature recognition module of the first intermediate model to obtain a second intermediate model; performing the same operation on the sample two-dimensional image by using the second intermediate model as that performed on the sample two-dimensional image by using the initial model to obtain a reconstruction loss value of the second intermediate model; and adjusting parameters of a feature reconstruction module of the second intermediate model based on the reconstruction loss value to obtain a trained deep learning model.

For example: first, in the training process of the first round, parameters of the feature extraction module may be adjusted based only on the mask feature loss value, and when the mask feature loss value is within a predetermined threshold range, a first intermediate model is obtained.

Secondly, inputting the sample two-dimensional image into the first intermediate model, and obtaining mask image features, sample image features, three-dimensional point cloud data, key point data of the sample two-dimensional image and an output sample three-dimensional image. In the training process of the turn, parameters of the feature recognition module of the first intermediate model can be adjusted mainly based on the key point feature loss value so as to improve the matching degree of image features and three-dimensional reconstruction parameters. And when the key point characteristic loss value is converged within a preset threshold range, obtaining a second intermediate model.

It should be noted that, in the process of adjusting the model parameters of the first intermediate model, the parameters of the feature extraction module may be fixed, or the parameters of the feature extraction module may be fine-tuned to accelerate the convergence speed of the key point loss value.

Again, the sample two-dimensional image is input into the second intermediate model. The feature extraction module and the feature recognition module of the second intermediate model can output image features matched with the three-dimensional reconstruction parameters through the previous training process, and can accurately recognize and obtain the three-dimensional reconstruction parameters. During this round of training, the feature reconstruction module may be adjusted based primarily on the reconstruction loss value. In the parameter adjustment process, weak supervision training can be performed by using the mask characteristic loss value and the key point characteristic loss value at the same time, so that the final reconstruction loss value is converged to a preset threshold range, and a trained deep learning model is obtained.

According to the embodiment of the disclosure, the parameters of a certain module are adjusted in a targeted manner based on the training mode of the gradient, so that the loss value corresponding to the module is converged. When training in the next stage is performed, the influence of errors of output data of the module on input data of the next module can be reduced, so that the convergence speed of the model is improved.

Fig. 6 schematically illustrates a framework diagram of a training method of a deep learning model according to an embodiment of the present disclosure.

As shown in fig. 6, the target sample two-dimensional image 601 is subjected to pixel segmentation, resulting in an unmasked image 602, and the unmasked image 602 may include a plurality of unmasked pixel blocks 603. The target sample two-dimensional image 601 is pixel segmented and masked to obtain a mask image 603, and the mask image 603 may include a plurality of locally masked pixel blocks 607.

The plurality of unmasked pixel blocks 603 are input into a visual morphing network (ViT) 604a, resulting in sample image features 605. The plurality of locally masked pixel blocks 607 are input to a visual morphing network (ViT) 604b resulting in a masked image feature 606. Parameters of the visual morphing network (ViT) 604b are adjusted based on the mask feature Loss (contrast) 614 such that the mask image features 606 are more similar to the sample image features 605 at the corresponding locations.

The mask image features 606 are identified to obtain image acquisition device parameters (R, t), identity parameters (α) for input to a 3DMM (3D deformation statistics network) 608 _id ) Expression parameter (alpha) _exp ) Texture parameter (alpha) _tex ) And an illumination parameter (γ). And the image acquisition device parameters (R, t), the identity parameters (alpha _id ) Expression parameter (alpha) _exp ) Texture parameter (alpha) _tex ) And illumination parameters (γ) are input into a 3DMM (3D deformation statistics network) 608, resulting in three-dimensional point cloud data 609 and texture maps 610. And rendering the reconstructed image 612 with a differentiable renderer (Differentiable Renderer) 611 based on the illumination parameters to obtain a three-dimensional reconstructed image 613.

A Photometric loss (loss) 615 can be calculated as a reconstruction loss value based on the three-dimensional reconstruction image 613 and the initial sample two-dimensional image 601. A keypoint Loss (Landmark Loss) 616 is calculated based on the point cloud features of the reconstructed image 612 and the keypoint features of the initial sample two-dimensional image. Parameters of the initial model are adjusted based on the sum of the mask feature loss 614, luminosity loss 615, and keypoint loss 616 as target losses, resulting in a trained deep learning model.

Fig. 7 schematically shows a block diagram of an image processing apparatus according to an embodiment of the present disclosure.

As shown in fig. 7, the image processing apparatus 700 may include a first alignment module 710, a first feature extraction module 720, a first feature recognition module 730, and a first feature reconstruction module 740.

The first alignment module 710 is configured to perform a key point alignment operation on an image to be processed based on a predetermined image, so as to obtain a target two-dimensional image.

A first feature extraction module 720, configured to extract image features of a target two-dimensional image, where the image features characterize the target features that match the reconstruction parameters used to reconstruct the image to be processed in three dimensions.

The first feature recognition module 730 is configured to recognize features of an image to obtain target parameters for three-dimensionally reconstructing an image to be processed.

The first feature reconstruction module 740 is configured to perform three-dimensional reconstruction on the target two-dimensional image based on the target parameter, so as to obtain a target three-dimensional image.

According to an embodiment of the present disclosure, the first alignment module includes: the device comprises a first detection sub-module, a second detection sub-module and a first alignment sub-module. The first detection sub-module is used for carrying out key point detection on the image to be processed to obtain a first key point group. And the second detection sub-module is used for carrying out key point detection on the preset image to obtain a second key point group. The first alignment sub-module is used for executing alignment operation on the image to be processed based on the corresponding relation between the key points in the first key point group and the second key point group, and obtaining the target two-dimensional image.

According to an embodiment of the present disclosure, the first alignment sub-module includes: a first position offset unit and a first scaling unit. And the first position offset unit is used for obtaining position offset information between the key points with the corresponding relation from the first key point group and the second key point group based on the corresponding relation between the key points in the first key point group and the second key point group. And the first scaling unit is used for scaling the image to be processed according to the position offset information to obtain a target two-dimensional image.

According to an embodiment of the present disclosure, the first feature extraction module includes: a first segmentation sub-module and a first attention sub-module. The first segmentation submodule is used for segmenting the target two-dimensional image according to the preset pixel size to obtain a plurality of pixel blocks. And the first attention sub-module is used for processing the pixel blocks based on a self-attention mechanism to obtain image characteristics.

According to an embodiment of the present disclosure, the first feature recognition module includes: the system comprises an expression parameter identification sub-module, a texture parameter identification sub-module, an illumination parameter identification sub-module, an acquisition equipment parameter identification sub-module and an identity parameter identification sub-module. And the expression parameter identification sub-module is used for carrying out expression parameter identification on the image characteristics to obtain target expression parameters. And the texture parameter identification sub-module is used for carrying out texture parameter identification on the image characteristics to obtain target texture parameters. And the illumination parameter identification sub-module is used for carrying out illumination parameter identification on the image characteristics to obtain target illumination parameters. And the acquisition equipment parameter identification sub-module is used for carrying out acquisition equipment parameter identification on the image characteristics to obtain target acquisition equipment parameters. And the identity parameter identification sub-module is used for carrying out identity parameter identification on the image characteristics to obtain target identity parameters.

According to an embodiment of the present disclosure, the first feature reconstruction module includes: the system comprises a texture image generation sub-module, a point cloud data generation sub-module and a rendering sub-module. And the texture image generation sub-module is used for processing the image to be processed based on the target texture parameters to obtain a target texture image. The point cloud data generation sub-module is used for processing the image to be processed based on the target acquisition equipment parameters, the target expression parameters and the target identity parameters to obtain target three-dimensional point cloud data. And the rendering sub-module is used for rendering the target three-dimensional point cloud data based on the target illumination parameters and the target texture image to obtain a target three-dimensional image.

Fig. 8 schematically illustrates a block diagram of a training apparatus of a deep learning model according to an embodiment of the present disclosure.

As shown in fig. 8, the training device 800 may include: a second alignment module 810, a masking module 820, a second feature extraction module 830, a second feature identification module 840, a second feature reconstruction module 850, a loss calculation module 860, and an adjustment module 870.

The second alignment module 810 is configured to perform a key point alignment operation on the sample two-dimensional image based on the predetermined image, to obtain a target two-dimensional image.

And the masking module 820 is used for performing local masking processing on the target sample two-dimensional image to obtain a masking image.

The second feature extraction module 830 is configured to extract mask image features of the mask image and sample image features of the target sample two-dimensional image.

The second feature recognition module 840 is configured to recognize the mask image feature, and obtain a target sample parameter for three-dimensionally reconstructing a two-dimensional image of the sample.

The second feature reconstruction module 850 is configured to perform three-dimensional reconstruction on the two-dimensional image of the target sample based on the target sample parameter, so as to obtain a three-dimensional image of the sample corresponding to the two-dimensional image of the sample.

The loss calculation module 860 is configured to obtain a loss value based on the target loss function according to the mask image feature, the sample two-dimensional image, and the sample three-dimensional image.

An adjustment module 870 is configured to adjust model parameters of the initial model based on the loss value to obtain a trained deep learning model.

According to an embodiment of the present disclosure, the loss calculation module includes: a mask feature loss calculation sub-module and a reconstruction loss calculation sub-module. And the mask feature loss calculation sub-module is used for obtaining a mask feature loss value according to the mask image feature and the sample image feature based on the first loss function. And the reconstruction loss calculation sub-module is used for obtaining a reconstruction loss value according to the two-dimensional sample image and the three-dimensional sample image based on the second loss function.

According to an embodiment of the present disclosure, a reconstruction loss calculation submodule includes: a projection unit and a reconstruction loss calculation unit. And the projection unit is used for projecting the sample three-dimensional image to obtain a two-dimensional projection image corresponding to the sample three-dimensional image. And the reconstruction loss calculation unit is used for obtaining a reconstruction loss value according to the sample two-dimensional image and the two-dimensional projection image based on the third loss function.

According to an embodiment of the present disclosure, the loss calculation module further includes: the system comprises a key point detection sub-module, a three-dimensional point cloud data generation sub-module, a key point feature extraction sub-module and a key point feature loss calculation module. And the key point detection sub-module is used for carrying out key point detection on the sample two-dimensional image to obtain a first key point characteristic. And the three-dimensional point cloud data generation sub-module is used for processing the two-dimensional sample image based on the target sample parameters to obtain three-dimensional point cloud data corresponding to the two-dimensional sample image. And the key point feature extraction sub-module is used for extracting a second key point feature from the three-dimensional point cloud data. And the key point feature loss calculation module is used for obtaining a key point feature loss value according to the first key point feature and the second key point feature based on the third loss function.

According to an embodiment of the present disclosure, the adjustment module includes: the system comprises a first parameter adjustment sub-module, a first training sub-module, a second parameter adjustment sub-module, a second training sub-module and a third parameter adjustment sub-module. And the first parameter adjustment sub-module is used for adjusting parameters of the feature extraction module based on the mask feature loss value to obtain a first intermediate model. And the first training submodule is used for executing the same operation on the sample two-dimensional image by using the first intermediate model as that of the initial model, so as to obtain the key point characteristic loss value of the first intermediate model. And the second parameter adjustment sub-module is used for adjusting parameters of the feature identification module of the first intermediate model based on the key point feature loss value to obtain a second intermediate model. And the second training submodule is used for executing the same operation on the sample two-dimensional image as that of the initial model on the sample two-dimensional image by using the second intermediate model to obtain a reconstruction loss value of the second intermediate model. And the third parameter adjustment sub-module is used for adjusting parameters of the characteristic reconstruction module of the second intermediate model based on the reconstruction loss value to obtain a trained deep learning model.

According to an embodiment of the present disclosure, a mask module includes: a second segmentation sub-module and a masking sub-module. And the second segmentation submodule is used for segmenting the target two-dimensional image according to the preset pixel size to obtain a plurality of sample pixel blocks. And the masking submodule is used for masking the plurality of sample pixel blocks randomly according to a preset masking proportion to obtain a masking image.

According to an embodiment of the present disclosure, the second alignment module includes: the third detection sub-module, the fourth detection sub-module and the second alignment sub-module. And the third detection submodule is used for carrying out key point detection on the two-dimensional image of the sample to obtain a first sample key point group. And the fourth detection sub-module is used for carrying out key point detection on the sample preset image to obtain a second sample key point group. And the second alignment sub-module is used for executing alignment operation on the sample two-dimensional image based on the corresponding relation between the key points in the first sample key point group and the second sample key point group to obtain the target sample two-dimensional image.

According to an embodiment of the present disclosure, the second alignment sub-module includes: a second position offset unit and a second scaling unit. And a second position offset unit configured to obtain sample position offset information between keypoints having a corresponding relationship from the first sample keypoint group and the second sample keypoint group based on the correspondence relationship between the keypoints in the first sample keypoint group and the second sample keypoint group. And the second scaling unit is used for scaling the sample two-dimensional image according to the sample position offset information to obtain a target sample two-dimensional image.

According to an embodiment of the present disclosure, the second feature extraction module includes: a second attention sub-module and a third attention sub-module. And the second attention submodule is used for processing a plurality of sample pixel blocks before masking based on a self-attention mechanism to obtain sample image characteristics. And the third attention submodule is used for processing the masked sample pixel blocks based on the self-attention mechanism to obtain mask image features.

According to embodiments of the present disclosure, the present disclosure also provides an electronic device, a readable storage medium and a computer program product.

According to an embodiment of the present disclosure, an electronic device includes: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor, the instructions being executable by the at least one processor to enable the at least one processor to perform the method as described above.

According to an embodiment of the present disclosure, a non-transitory computer-readable storage medium storing computer instructions for causing a computer to perform the method as described above.

According to an embodiment of the present disclosure, a computer program product comprising a computer program which, when executed by a processor, implements a method as described above.

Fig. 9 shows a schematic block diagram of an example electronic device 900 that may be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smartphones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 9, the apparatus 900 includes a computing unit 901 that can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM) 902 or a computer program loaded from a storage unit 908 into a Random Access Memory (RAM) 903. In the RAM 903, various programs and data required for the operation of the device 900 can also be stored. The computing unit 901, the ROM 902, and the RAM 903 are connected to each other by a bus 904. An input/output (I/O) interface 905 is also connected to the bus 904.

Various components in device 900 are connected to I/O interface 905, including: an input unit 906 such as a keyboard, a mouse, or the like; an output unit 907 such as various types of displays, speakers, and the like; a storage unit 908 such as a magnetic disk, an optical disk, or the like; and a communication unit 909 such as a network card, modem, wireless communication transceiver, or the like. The communication unit 909 allows the device 900 to exchange information/data with other devices through a computer network such as the internet and/or various telecommunications networks.

The computing unit 901 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of computing unit 901 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, etc. The computing unit 901 performs the respective methods and processes described above, for example, an image processing method or a training method of a deep learning model. For example, in some embodiments, the image processing method or training method of the deep learning model may be implemented as a computer software program tangibly embodied on a machine-readable medium, such as the storage unit 908. In some embodiments, part or all of the computer program may be loaded and/or installed onto the device 900 via the ROM 902 and/or the communication unit 909. When the computer program is loaded into the RAM 903 and executed by the computing unit 901, one or more steps of the image processing method or the training method of the deep learning model described above may be performed. Alternatively, in other embodiments, the computing unit 901 may be configured to perform the image processing method or the training method of the deep learning model in any other suitable way (e.g. by means of firmware).

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuit systems, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), systems On Chip (SOCs), complex Programmable Logic Devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs, the one or more computer programs may be executed and/or interpreted on a programmable system including at least one programmable processor, which may be a special purpose or general-purpose programmable processor, that may receive data and instructions from, and transmit data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for carrying out methods of the present disclosure may be written in any combination of one or more programming languages. These program code may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus such that the program code, when executed by the processor or controller, causes the functions/operations specified in the flowchart and/or block diagram to be implemented. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and pointing device (e.g., a mouse or trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic input, speech input, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), and the internet.

The computer system may include a client and a server. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server may be a cloud server, a server of a distributed system, or a server incorporating a blockchain.

It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps recited in the present disclosure may be performed in parallel or sequentially or in a different order, provided that the desired results of the technical solutions of the present disclosure are achieved, and are not limited herein.

The above detailed description should not be taken as limiting the scope of the present disclosure. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives are possible, depending on design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present disclosure are intended to be included within the scope of the present disclosure.

Claims

1. An image processing method, comprising:

performing key point alignment operation on an image to be processed based on a preset image to obtain a target two-dimensional image;

extracting image features of the target two-dimensional image, wherein the image features represent target features matched with reconstruction parameters for three-dimensionally reconstructing the image to be processed;

identifying the image characteristics to obtain target parameters for three-dimensionally reconstructing the image to be processed; and

And carrying out three-dimensional reconstruction on the target two-dimensional image based on the target parameters to obtain a target three-dimensional image.

2. The method of claim 1, wherein the performing a keypoint alignment operation on the image to be processed based on the predetermined image results in the target two-dimensional image, comprising:

performing key point detection on the image to be processed to obtain a first key point group;

performing key point detection on the preset image to obtain a second key point group; and

and executing alignment operation on the image to be processed based on the corresponding relation between the key points in the first key point group and the second key point group to obtain a target two-dimensional image.

3. The method according to claim 2, wherein the performing an alignment operation on the image to be processed based on the correspondence between the keypoints in the first keypoint group and the second keypoint group to obtain a target two-dimensional image includes:

obtaining position offset information between key points with corresponding relation from the first key point group and the second key point group based on the corresponding relation between the key points in the first key point group and the second key point group; and

And scaling the image to be processed according to the position offset information to obtain the target two-dimensional image.

4. The method of claim 1, wherein the extracting image features of the target two-dimensional image comprises:

dividing the target two-dimensional image according to a preset pixel size to obtain a plurality of pixel blocks; and

and processing the pixel blocks based on a self-attention mechanism to obtain the image characteristics.

5. The method of claim 1, wherein the identifying the image features to obtain target parameters for three-dimensional reconstruction of the image to be processed comprises:

carrying out expression parameter identification on the image characteristics to obtain target expression parameters;

carrying out texture parameter identification on the image characteristics to obtain target texture parameters;

carrying out illumination parameter identification on the image characteristics to obtain target illumination parameters;

identifying the acquisition equipment parameters of the image features to obtain target acquisition equipment parameters; and

and carrying out identity parameter identification on the image characteristics to obtain target identity parameters.

6. The method of claim 1, wherein the target parameters include a target expression parameter, a target texture parameter, a target illumination parameter, a target acquisition device parameter, and a target identity parameter; the three-dimensional reconstruction is carried out on the image to be processed based on the target parameters to obtain a target three-dimensional image, which comprises the following steps:

Processing the image to be processed based on the target texture parameters to obtain a target texture image;

processing the image to be processed based on the target acquisition equipment parameters, the target expression parameters and the target identity parameters to obtain target three-dimensional point cloud data; and

and rendering the target three-dimensional point cloud data based on the target illumination parameters and the target texture image to obtain the target three-dimensional image.

7. A training method of a deep learning model, comprising:

performing key point alignment operation on the sample two-dimensional image based on the preset image to obtain a target sample two-dimensional image;

carrying out local mask processing on the target sample two-dimensional image to obtain a mask image;

extracting mask image features of the mask image and sample image features of the target sample two-dimensional image;

identifying the mask image features to obtain target sample parameters for three-dimensionally reconstructing the sample two-dimensional image;

based on the target sample parameters, carrying out three-dimensional reconstruction on the target sample two-dimensional image to obtain a sample three-dimensional image corresponding to the sample two-dimensional image;

obtaining a loss value according to the mask image characteristics, the sample two-dimensional image and the sample three-dimensional image based on a target loss function; and

And adjusting model parameters of the initial model based on the loss value to obtain a trained deep learning model.

8. The method of claim 7, wherein the deriving a loss value based on the mask image feature, the sample two-dimensional image, and the sample three-dimensional image based on an objective loss function comprises:

based on a first loss function, obtaining a mask feature loss value according to the mask image feature and the sample image feature; and

and obtaining a reconstruction loss value according to the sample two-dimensional image and the sample three-dimensional image based on a second loss function.

9. The method of claim 8, wherein the obtaining a reconstructed loss value from the sample two-dimensional image and the sample three-dimensional image based on the second loss function comprises:

projecting the sample three-dimensional image to obtain a two-dimensional projection image corresponding to the sample three-dimensional image; and

and obtaining the reconstruction loss value according to the sample two-dimensional image and the two-dimensional projection image based on the second loss function.

10. The method of claim 8 or 9, further comprising:

Performing key point detection on the sample two-dimensional image to obtain a first key point feature;

processing the sample two-dimensional image based on the target sample parameters to obtain three-dimensional point cloud data corresponding to the sample two-dimensional image;

extracting a second key point feature from the three-dimensional point cloud data; and

and obtaining a key point feature loss value according to the first key point feature and the second key point feature based on a third loss function.

11. The method of claim 7, wherein the initial model comprises a feature extraction module, a feature identification module, and a feature reconstruction module; the loss values comprise mask feature loss values, reconstruction loss values and key point feature loss values;

the step of adjusting model parameters of the initial model based on the loss value to obtain a trained deep learning model comprises the following steps:

based on the mask feature loss value, adjusting parameters of the feature extraction module to obtain a first intermediate model;

performing the same operation on the sample two-dimensional image by using a first intermediate model as that performed on the sample two-dimensional image by using the initial model, so as to obtain a key point characteristic loss value of the first intermediate model;

Based on the key point feature loss value, adjusting parameters of a feature recognition module of the first intermediate model to obtain a second intermediate model;

performing the same operation on the sample two-dimensional image by using the second intermediate model as that performed on the sample two-dimensional image by using the initial model, so as to obtain a reconstruction loss value of the second intermediate model; and

and adjusting parameters of a characteristic reconstruction module of the second intermediate model based on the reconstruction loss value to obtain the trained deep learning model.

12. The method of claim 7, wherein the performing local masking processing on the two-dimensional image of the target sample to obtain a masked image comprises:

dividing the target two-dimensional image according to a preset pixel size to obtain a plurality of sample pixel blocks; and

and masking the plurality of sample pixel blocks randomly according to a preset masking proportion to obtain the masking image.

13. The method of claim 7, wherein performing a keypoint alignment operation on the sample two-dimensional image based on the predetermined image results in the target sample two-dimensional image, comprising:

performing key point detection on the sample two-dimensional image to obtain a first sample key point group;

Performing key point detection on the sample preset image to obtain a second sample key point group; and

and performing alignment operation on the sample two-dimensional image based on the correspondence between the key points in the first sample key point group and the second sample key point group to obtain a target sample two-dimensional image.

14. The method of claim 13, wherein the performing an alignment operation on the sample two-dimensional image based on correspondence between keypoints in the first sample keypoint group and the second sample keypoint group to obtain a target sample two-dimensional image comprises:

obtaining sample position offset information between key points with corresponding relation from the first sample key point group and the second sample key point group based on the corresponding relation between the key points in the first sample key point group and the second sample key point group; and

and scaling the sample two-dimensional image according to the sample position offset information to obtain the target sample two-dimensional image.

15. The method of claim 7, wherein the extracting mask image features of the mask image and sample image features of the target sample two-dimensional image comprises:

Processing a plurality of sample pixel blocks before masking based on a self-attention mechanism to obtain the sample image characteristics; and

and processing the masked sample pixel blocks based on a self-attention mechanism to obtain the mask image characteristics.

16. An image processing apparatus comprising:

the first alignment module is used for executing key point alignment operation on the image to be processed based on the preset image to obtain a target two-dimensional image;

the first feature extraction module is used for extracting image features of the target two-dimensional image, wherein the image features represent target features matched with reconstruction parameters for three-dimensionally reconstructing the image to be processed;

the first feature recognition module is used for recognizing the image features to obtain target parameters for three-dimensionally reconstructing the image to be processed; and

and the first characteristic reconstruction module is used for carrying out three-dimensional reconstruction on the target two-dimensional image based on the target parameters to obtain a target three-dimensional image.

17. The apparatus of claim 16, wherein the first alignment module comprises:

the first detection submodule is used for carrying out key point detection on the image to be processed to obtain a first key point group;

The second detection sub-module is used for carrying out key point detection on the preset image to obtain a second key point group; and

and the first alignment sub-module is used for executing alignment operation on the image to be processed based on the corresponding relation between the key points in the first key point group and the second key point group to obtain a target two-dimensional image.

18. The apparatus of claim 17, wherein the first alignment sub-module comprises:

a first position offset unit, configured to obtain position offset information between key points having a corresponding relationship from a first key point group and a second key point group based on a corresponding relationship between key points in the first key point group and the second key point group; and

and the first scaling unit is used for scaling the image to be processed according to the position offset information to obtain the target two-dimensional image.

19. The apparatus of claim 16, wherein the first feature extraction module comprises:

the first segmentation submodule is used for segmenting the target two-dimensional image according to a preset pixel size to obtain a plurality of pixel blocks; and

and the first attention sub-module is used for processing the pixel blocks based on a self-attention mechanism to obtain the image characteristics.

20. The apparatus of claim 16, wherein the first feature identification module comprises:

the expression parameter identification sub-module is used for carrying out expression parameter identification on the image characteristics to obtain target expression parameters;

the texture parameter identification sub-module is used for carrying out texture parameter identification on the image characteristics to obtain target texture parameters;

the illumination parameter identification sub-module is used for carrying out illumination parameter identification on the image characteristics to obtain target illumination parameters;

the acquisition equipment parameter identification sub-module is used for carrying out acquisition equipment parameter identification on the image characteristics to obtain target acquisition equipment parameters; and

and the identity parameter identification sub-module is used for carrying out identity parameter identification on the image characteristics to obtain target identity parameters.

21. The apparatus of claim 16, wherein the target parameters comprise a target expression parameter, a target texture parameter, a target illumination parameter, a target acquisition device parameter, and a target identity parameter; the first feature reconstruction module includes:

the texture image generation sub-module is used for processing the image to be processed based on the target texture parameters to obtain a target texture image;

The point cloud data generation sub-module is used for processing the image to be processed based on the target acquisition equipment parameters, the target expression parameters and the target identity parameters to obtain target three-dimensional point cloud data; and

and the rendering sub-module is used for rendering the target three-dimensional point cloud data based on the target illumination parameters and the target texture image to obtain the target three-dimensional image.

22. A training device for a deep learning model, comprising:

the second alignment module is used for executing key point alignment operation on the sample two-dimensional image based on the preset image to obtain a target sample two-dimensional image;

the mask module is used for carrying out local mask processing on the target sample two-dimensional image to obtain a mask image;

the second feature extraction module is used for extracting mask image features of the mask image and sample image features of the target sample two-dimensional image;

the second feature recognition module is used for recognizing the mask image features to obtain target sample parameters for three-dimensionally reconstructing the sample two-dimensional image;

the second characteristic reconstruction module is used for carrying out three-dimensional reconstruction on the two-dimensional image of the target sample based on the target sample parameters to obtain a sample three-dimensional image corresponding to the two-dimensional image of the sample;

The loss calculation module is used for obtaining a loss value according to the mask image characteristics, the sample two-dimensional image and the sample three-dimensional image based on a target loss function; and

and the adjusting module is used for adjusting the model parameters of the initial model based on the loss value to obtain a trained deep learning model.

23. The apparatus of claim 22, wherein the loss calculation module comprises:

a mask feature loss calculation sub-module, configured to obtain a mask feature loss value according to the mask image feature and the sample image feature based on a first loss function; and

and the reconstruction loss calculation sub-module is used for obtaining a reconstruction loss value according to the sample two-dimensional image and the sample three-dimensional image based on a second loss function.

24. The apparatus of claim 23, wherein the reconstruction loss calculation sub-module comprises:

the projection unit is used for projecting the sample three-dimensional image to obtain a two-dimensional projection image corresponding to the sample three-dimensional image; and

and the reconstruction loss calculation unit is used for obtaining the reconstruction loss value according to the two-dimensional image of the sample and the two-dimensional projection image based on the second loss function.

25. The apparatus of claim 23 or 24, wherein the loss calculation module further comprises:

the key point detection sub-module is used for carrying out key point detection on the sample two-dimensional image to obtain a first key point characteristic;

the three-dimensional point cloud data generation sub-module is used for processing the sample two-dimensional image based on the target sample parameters to obtain three-dimensional point cloud data corresponding to the sample two-dimensional image;

the key point feature extraction submodule is used for extracting a second key point feature from the three-dimensional point cloud data; and

and the key point feature loss calculation module is used for obtaining a key point feature loss value according to the first key point feature and the second key point feature based on a third loss function.

26. The apparatus of claim 25, wherein the initial model comprises a feature extraction module, a feature identification module, and a feature reconstruction module; the loss values comprise mask feature loss values, reconstruction loss values and key point feature loss values; the adjustment module includes:

the first parameter adjustment submodule is used for adjusting parameters of the feature extraction module based on the mask feature loss value to obtain a first intermediate model;

The first training submodule is used for executing the same operation on the sample two-dimensional image as that of the initial model on the sample two-dimensional image by using a first intermediate model to obtain a key point characteristic loss value of the first intermediate model;

the second parameter adjustment sub-module is used for adjusting parameters of the feature identification module of the first intermediate model based on the key point feature loss value to obtain a second intermediate model;

the second training submodule is used for executing the same operation on the sample two-dimensional image as that of the initial model on the sample two-dimensional image by utilizing the second intermediate model to obtain a reconstruction loss value of the second intermediate model; and

and the third parameter adjustment sub-module is used for adjusting parameters of the characteristic reconstruction module of the second intermediate model based on the reconstruction loss value to obtain the trained deep learning model.

27. The apparatus of claim 22, wherein the masking module comprises:

the second segmentation submodule is used for segmenting the target two-dimensional image according to a preset pixel size to obtain a plurality of sample pixel blocks; and

and the masking submodule is used for masking the plurality of sample pixel blocks randomly according to a preset masking proportion to obtain the masking image.

28. The apparatus of claim 22, wherein the second alignment module comprises:

the third detection submodule is used for carrying out key point detection on the sample two-dimensional image to obtain a first sample key point group;

the fourth detection submodule is used for carrying out key point detection on the sample preset image to obtain a second sample key point group; and

and the second alignment sub-module is used for executing alignment operation on the sample two-dimensional image based on the corresponding relation between the key points in the first sample key point group and the second sample key point group to obtain a target sample two-dimensional image.

29. The apparatus of claim 28, wherein the second alignment sub-module comprises:

a second position offset unit configured to obtain sample position offset information between keypoints having a corresponding relationship from a first sample keypoint group and a second sample keypoint group based on a corresponding relationship between keypoints in the first sample keypoint group and the second sample keypoint group; and

and the second scaling unit is used for scaling the sample two-dimensional image according to the sample position offset information to obtain the target sample two-dimensional image.

30. The apparatus of claim 22, wherein the second feature extraction module comprises:

a second attention sub-module, configured to process, based on a self-attention mechanism, a plurality of sample pixel blocks before masking, to obtain the sample image feature; and

and the third attention submodule is used for processing the masked sample pixel blocks based on a self-attention mechanism to obtain the mask image characteristics.

31. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-15.

32. A non-transitory computer readable storage medium storing computer instructions for causing the computer to perform the method of any one of claims 1-15.

33. A computer program product comprising a computer program which, when executed by a processor, implements the method according to any of claims 1-15.