CN109754416B

CN109754416B - Image processing apparatus and method

Info

Publication number: CN109754416B
Application number: CN201711070256.XA
Authority: CN
Inventors: 田虎; 李斐
Original assignee: Fujitsu Ltd
Current assignee: Fujitsu Ltd
Priority date: 2017-11-03
Filing date: 2017-11-03
Publication date: 2023-08-04
Anticipated expiration: 2037-11-03
Also published as: CN109754416A

Abstract

The present disclosure relates to an image processing apparatus and method. The image processing device includes a converter that converts an input image into a depth image to obtain a converted depth for each pixel of the input image; and a classifier that classifies between the depth of the transformation and the true depth from the depth dataset, wherein the classifier and the transformer are trained until the classifier cannot distinguish between the depth of the transformation and the true depth. Using the image processing apparatus and method according to the present disclosure, depth can be learned from a single image through countermeasure training, which can learn not only the depth of the single image through a converter but also the high-order consistency of depth through a classifier. Through such countermeasure training, the converter may output a depth map having a similar distribution as the true depth map.

Description

Image processing apparatus and method

Technical Field

The present disclosure relates to the field of image processing, and in particular to single image depth estimation based on countermeasure learning.

Background

This section provides background information related to the present disclosure, which is not necessarily prior art.

With the rapid development of modern cameras, a large number of images with high resolution can now be acquired. The restoration of a three-dimensional (3D) structure of a scene or object from these images is of great significance for many computer applications such as entertainment, augmented reality, ancient site protection, robotics, etc. Depth estimation from multiple viewpoints is a key step in image-based 3D modeling and has been widely studied by researchers.

In the prior art, methods for estimating depth from a single image mainly use some monocular depth cues, such as focus, occlusion, etc. However, these methods have a limitation in that the algorithm fails when these depth cues are not present in the scene. Recently, many researchers have learned depth from a single image using a method of deep learning such as Convolutional Neural Network (CNN), linear regression model, and the like. However, these methods typically ignore the relationship between adjacent depths on the depth map, which is important for generating high quality depth maps. While some graph model-based approaches may enhance the local smoothness of the depth map to some extent, it is difficult to model the high-order consistency of depth.

In the present disclosure, a method of using countermeasure learning is proposed that is capable of learning depth from a single image. Unlike prior art methods, the method according to the present disclosure can learn depth not only from a single image through a transducer, but also through a classifier. Through such countermeasure learning, the converter can output a depth map having a very uniform distribution with the true depth map.

Disclosure of Invention

This section provides a general summary of the disclosure, and is not a comprehensive disclosure of its full scope or all of its features.

An object of the present disclosure is to provide an image processing apparatus and an image processing method that can learn depth from a single image through countermeasure training, which can learn not only the depth of the single image through a converter but also the high-order consistency of the depth through a classifier. Through such countermeasure training, the converter may output a depth map having a similar distribution as the true depth map.

According to an aspect of the present disclosure, there is provided an image processing apparatus including: a converter that converts an input image into a depth image to obtain a converted depth for each pixel of the input image; and a classifier that classifies between the depth of the transformation and the true depth from the depth dataset, wherein the classifier and the transformer are trained until the classifier cannot distinguish between the depth of the transformation and the true depth.

According to another aspect of the present disclosure, there is provided an image processing method including: converting, by a converter, an input image into a depth image to obtain a converted depth for each pixel of the input image; and classifying, by a classifier, between the depth of the conversion and a true depth from a depth dataset, wherein the classifier and the converter are trained until the classifier cannot distinguish between the depth of the conversion and the true depth.

According to another aspect of the present disclosure, there is provided a program product comprising machine readable instruction code stored therein, wherein the instruction code, when read and executed by a computer, is capable of causing the computer to perform an image processing method according to the present disclosure.

According to another aspect of the present disclosure, there is provided a machine-readable storage medium having embodied thereon a program product according to the present disclosure.

Using the image processing apparatus and method according to the present disclosure, depth can be learned from a single image through countermeasure training, which can learn not only the depth of the single image through a converter but also the high-order consistency of depth through a classifier. Through such countermeasure training, the converter may output a depth map having a similar distribution as the true depth map.

Further areas of applicability will become apparent from the description provided herein. The description and specific examples in this summary are intended for purposes of illustration only and are not intended to limit the scope of the present disclosure.

Drawings

The drawings described herein are for illustration purposes only of selected embodiments and not all possible implementations, and are not intended to limit the scope of the present disclosure. In the drawings:

fig. 1 is a block diagram of an image processing apparatus according to an embodiment of the present disclosure;

fig. 2 is a diagram showing a structural example of an image processing apparatus according to an embodiment of the present disclosure;

FIG. 3 is a flow chart of an image processing method according to an embodiment of the present disclosure; and

fig. 4 is a block diagram of an exemplary structure of a general-purpose personal computer in which an image processing apparatus and method according to an embodiment of the present disclosure may be implemented.

While the disclosure is susceptible to various modifications and alternative forms, specific embodiments thereof have been shown by way of example in the drawings and are herein described in detail. It should be understood, however, that the description herein of specific embodiments is not intended to limit the disclosure to the particular forms disclosed, but on the contrary, the intention is to cover all modifications, equivalents, and alternatives falling within the spirit and scope of the disclosure. It is noted that corresponding reference numerals indicate corresponding parts throughout the several views of the drawings.

Detailed Description

Examples of the present disclosure will now be described more fully with reference to the accompanying drawings. The following description is merely exemplary in nature and is not intended to limit the present disclosure, application, or uses.

Example embodiments are provided so that this disclosure will be thorough and will fully convey the scope to those skilled in the art. Numerous specific details are set forth such as examples of specific components, devices, and methods in order to provide a thorough understanding of embodiments of the present disclosure. It will be apparent to those skilled in the art that the exemplary embodiments may be embodied in many different forms without the use of specific details, neither of which should be construed to limit the scope of the disclosure. In certain example embodiments, well-known processes, well-known structures, and well-known techniques have not been described in detail.

In order to better understand the technical solutions of the present disclosure, the following describes in more detail an image processing apparatus and method of the present disclosure.

Fig. 1 shows a block diagram of an image processing apparatus 100 according to an embodiment of the present disclosure. As shown in fig. 1, an image processing apparatus 100 according to an embodiment of the present disclosure may include a converter 101 and a classifier 102.

The converter 101 may convert the input image into a depth image to obtain a converted depth for each pixel of the input image, and then the classifier 102 may classify between the converted depth and the true depth from the depth dataset. Wherein classifier 102 and converter 101 may be trained until classifier 102 is unable to distinguish between depth of conversion and true depth.

Specifically, as shown in fig. 2, the converter 101 may convert an input image, such as a single color image or a single grayscale image, into a depth image having a near real depth, and confuse the classifier 102 so that the classifier 102 does not know whether such a depth image originates from the converted depth image or from the real depth image in the database. The classifier 102 may then classify between the depth from the converter 101 and the true depth from the dataset in the depth database. The competition between the two forms a countermeasure.

The classifier 102 and the converter 101 may then be trained until it is difficult for the classifier 102 to distinguish whether it is a converted depth image or a real depth image. According to one embodiment of the present disclosure, classifier 102 and converter 101 may be trained together. According to another embodiment of the present disclosure, the converter 101 may be trained until the difference between the depth of conversion for each pixel and its true depth is less than a predetermined threshold. Here, it should be apparent to those skilled in the art that the predetermined threshold value may be set according to practical experience or actual needs of those skilled in the art.

According to one embodiment of the present disclosure, the converter 101 may convert based on the input image or the extracted features of the input image. Here, it should be apparent to those skilled in the art that the extracted feature may be any image feature that may represent an original image, such as a gray level, a color, a gradient, or the like of the image.

According to one embodiment of the present disclosure, the classifier 102 may classify an input depth image or features extracted on the depth image. Here, it should be clear to those skilled in the art that the extracted features may be, for example, bright-dark features, texture features, and the like of the gray-scale image.

According to one embodiment of the present disclosure, classifier 102 may output a probability between 0 and 1. Wherein, the closer the probability is to 1, the more likely the input depth image is derived from the real depth image; conversely, the closer the probability is to 0, indicating that the input depth image is more likely to originate from the depth image output by the converter. Here, it should be apparent to those skilled in the art that a threshold value may be set according to practical experience or actual needs, and such a depth image may be represented by a tag 1 when the probability is greater than the threshold value; conversely, when the probability is less than the threshold, such a depth image may be represented by a label 0, as shown in fig. 2.

According to one embodiment of the present disclosure, classifier 102 may be various classifiers in machine learning, such as classical support vector machines, logistic regression, popular Convolutional Neural Networks (CNNs), and the like.

According to one embodiment of the present disclosure, training classifier 102 may be to minimize a loss function represented by the following expression (1):

wherein T represents a converter, C represents a classifier, and θ _T And theta _C Representing the parameters they each need to learn, i.e. train. I _n Representing an input image, D _n Representing its corresponding real depth image,and->Respectively represent the input images I _n And a true depth image D _n Features extracted from the above.

Thereby the processing time of the product is reduced,representing the input image I from _n Features extracted from the above,/->Representing the depth image output by the converter T +.>Representing features extracted from the input depth image, < >>Representing the probability value output by the classifier C.

Dis _c (x, label) represents the distance between the output probability value x and the actual label. Thereby the processing time of the product is reduced,representing the distance between the probability value obtained by the depth image obtained by the converter T through the classifier C and the real tag 0 thereof; but->Representing the distance between the probability value obtained by the real depth image through the classifier C and its real label 1. Here, it should be clear to those skilled in the art that the distance has various expressions. For example, it may be expressed as a simple euclidean distance or euclidean distance: dis (Dis) _C (x，label)＝||x-label|| ² Or cross entropy distance expressed as two classes: dis (Dis) _X (x，label)＝-(label*lnx+(1-label)*ln(1-x))。

According to the present embodiment, training the classifier 102 (even if the loss function represented by the above expression (1) is minimized) can enable the classifier 102 to have the ability to distinguish the type of the input depth image, that is, the ability to correctly distinguish whether the input depth image is a converted depth image or a real depth image.

According to one embodiment of the present disclosure, the converter 101 may take an input image or an extracted image feature of the input image as an input, and output a converted depth value for each pixel of the input image. Here, it should be apparent to those skilled in the art that various converters that convert an input image into a depth image, such as a linear regressor, a Convolutional Neural Network (CNN), and the like, may be used.

The purpose of the converter 101 has two aspects. First, each depth value in the depth image generated by the input image via the converter 101 is made as close as possible to the depth value in the real depth image. Second, the classifier 102 is confused so that the classifier 102 disambiguates whether the input depth image originated from the depth image of the converter 101 or from the actual depth image of the database.

According to one embodiment of the present disclosure, the loss function for training the converter 101 may be expressed as an expression (2) having two terms, and as such, the training converter 101 may be configured to minimize the loss function expressed as the following expression (2):

Dis _T (x, y) represents the distance between vector x and vector y, and as such, the distance has a variety of representations, which may be represented as, for example, an L2 distance, an L1 distance, and the like. Minimizing the first term in expression (2) can cause the depth image output by the converter 101The depth value of each pixel gradually approaches to the real depth image D _n The depth value of the corresponding pixel. Maximizing the second term in expression (2)Miniaturization can cause the classifier 102 to add the depth image outputted from the converter 101>It is determined to originate from a true depth image (tag 1). Thus, the second term in expression (2) acts in opposition to the first term in expression (1), thereby forming a countermeasure.

According to one embodiment of the present disclosure, expression (1) and expression (2) may be trained alternately. And such training may also be referred to as countermeasure training. The second term in expression (2) can be regarded as a high-order constraint term capable of reflecting structural information of a depth image, such as smoothness of the inside of an object on the depth image, variability of the edge of the object, and the like, and countermeasure training enables the depth image output by the converter 101 according to the present disclosure to have the above-described characteristics possessed by a real depth image.

According to one embodiment of the present disclosure, an alternate training approach may be employed such that parameters of the classifier 102 and parameters of the converter 101 are alternately updated. For example, the parameter θ of the converter T can be made _T The parameters theta of the training classifier C remain unchanged _C m times. Then, the parameter θ of the classifier C can be made _C The parameter θ of the training converter T remains unchanged _T k times. The classifier C and the converter T are trained alternately until the classifier C can no longer distinguish whether the input depth image originates from a real depth image or from a depth image output by the converter (i.e. the classification performance is close to 50%).

With the image processing apparatus according to the present disclosure, depth can be learned from a single image through countermeasure training, which can learn not only the depth of the single image through the converter but also the high-order consistency of the depth through the classifier. Through such countermeasure training, the converter may output a depth map having a similar distribution as the true depth map.

An image processing method according to an embodiment of the present disclosure will be described below with reference to fig. 3. As shown in fig. 3, the image processing method according to the embodiment of the present disclosure starts at step S310.

In step S310, an input image is converted into a depth image by a converter to obtain a converted depth for each pixel of the input image.

Next, in step S320, classification is performed by a classifier between the converted depth and the true depth from the depth dataset.

Next, in step S330, it is determined whether the classifier can distinguish the depth of the conversion from the real depth.

In step S330, when it is determined that the classifier can distinguish the depth of the conversion from the true depth, the classifier and the converter are trained (yes in step S330, step S310 and step S320 are returned to, and the classifier and the converter are alternately trained again as described above) until the classifier can no longer distinguish the depth of the conversion from the true depth (no in step S330, training is ended).

According to an image processing method of an embodiment of the present disclosure, the input image may be a single color image or a single gray scale image.

The image processing method according to one embodiment of the present disclosure may further include training the converter until a difference between the converted depth for each pixel and the true depth thereof is less than a predetermined threshold.

According to an image processing method of an embodiment of the present disclosure, the converter may convert based on the extracted features of the input image.

According to an image processing method of an embodiment of the present disclosure, the classifier and the converter may be trained together.

According to an image processing method of an embodiment of the present disclosure, when the classifier and the converter are trained together, parameters of the classifier and parameters of the converter may be updated alternately.

According to the image processing method of one embodiment of the present disclosure, the parameters of the converter may be fixed when the classifier is trained, and the parameters of the classifier may be fixed when the converter is trained.

Training the classifier according to an embodiment of the present disclosure may further include minimizing a trained loss function.

According to an image processing method of an embodiment of the present disclosure, a trained loss function may be minimized based on a Euclidean distance function or a cross entropy loss function.

According to an image processing method of an embodiment of the present disclosure, training the converter may further include causing the classifier to treat the depth of the conversion as the true depth.

According to an embodiment of the present disclosure, the classifier may be one of a support vector machine, a logistic regression, and a depth network classifier.

According to an image processing method of an embodiment of the present disclosure, the converter may be one of a linear regression model and a convolutional neural network model.

Various specific implementations of the above steps of the image processing method according to the embodiments of the present disclosure have been described in detail above and will not be repeated here.

It is apparent that the respective operation procedures of the image processing method according to the present disclosure may be implemented in the form of computer-executable programs stored in various machine-readable storage media.

Moreover, the object of the present disclosure can also be achieved by: the storage medium storing the executable program codes described above is directly or indirectly supplied to a system or apparatus, and a computer or a Central Processing Unit (CPU) in the system or apparatus reads out and executes the program codes described above. At this time, the embodiment of the present disclosure is not limited to the program as long as the system or the apparatus has a function of executing the program, and the program may be in any form, for example, a target program, a program executed by an interpreter, or a script program provided to an operating system, or the like.

Such machine-readable storage media include, but are not limited to: various memories and storage units, semiconductor devices, magnetic disk units such as optical, magnetic and magneto-optical disks, and other media suitable for storing information, etc.

In addition, the technical solution of the present disclosure can also be implemented by connecting a computer to a corresponding website on the internet, and downloading and installing computer program code according to the present disclosure into the computer and then executing the program.

Fig. 4 is a block diagram of an exemplary structure of a general-purpose personal computer 1300 in which an image processing apparatus and method according to an embodiment of the present disclosure may be implemented.

As shown in fig. 4, the CPU 1301 executes various processes according to a program stored in a Read Only Memory (ROM) 1302 or a program loaded from a storage section 1308 to a Random Access Memory (RAM) 1303. In the RAM 1303, data necessary when the CPU 1301 executes various processes and the like is also stored as needed. The CPU 1301, ROM 1302, and RAM 1303 are connected to each other via a bus 1304. An input/output interface 1305 is also connected to the bus 1304.

The following components are connected to the input/output interface 1305: an input portion 1306 (including a keyboard, a mouse, and the like), an output portion 1307 (including a display such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, and a speaker, and the like), a storage portion 1308 (including a hard disk, and the like), and a communication portion 1309 (including a network interface card such as a LAN card, a modem, and the like). The communication section 1309 performs a communication process via a network such as the internet. The drive 1310 may also be connected to the input/output interface 1305 as desired. The removable medium 1311 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is installed as needed on the drive 1310, so that a computer program read out therefrom is installed into the storage section 1308 as needed.

In the case of implementing the above-described series of processes by software, a program constituting the software is installed from a network such as the internet or a storage medium such as the removable medium 1311.

It will be appreciated by those skilled in the art that such a storage medium is not limited to the removable medium 1311 shown in fig. 4, in which the program is stored, which is distributed separately from the apparatus to provide the program to the user. Examples of the removable medium 1311 include a magnetic disk (including a floppy disk (registered trademark)), an optical disk (including a compact disk read only memory (CD-ROM) and a Digital Versatile Disk (DVD)), a magneto-optical disk (including a Mini Disk (MD) (registered trademark)), and a semiconductor memory. Alternatively, the storage medium may be a ROM 1302, a hard disk contained in the storage section 1308, or the like, in which a program is stored, and distributed to users together with a device containing them.

In the systems and methods of the present disclosure, it is apparent that components or steps may be disassembled and/or assembled. Such decomposition and/or recombination should be considered equivalent to the present disclosure. The steps of executing the series of processes may naturally be executed in chronological order in the order described, but are not necessarily executed in chronological order. Some steps may be performed in parallel or independently of each other.

Although the embodiments of the present disclosure have been described in detail above with reference to the accompanying drawings, it should be understood that the above-described embodiments are merely illustrative of the present disclosure and not limiting thereof. Various modifications and alterations to the above described embodiments may be made by those skilled in the art without departing from the spirit and scope of the disclosure. The scope of the disclosure is, therefore, indicated only by the appended claims and their equivalents.

With respect to implementations including the above examples, the following supplementary notes are also disclosed:

supplementary note 1. An image processing apparatus includes:

a converter that converts an input image into a depth image to obtain a converted depth for each pixel of the input image; and

a classifier that classifies between the depth of the transformation and the true depth from the depth dataset,

wherein the classifier and the transducer are trained until the classifier is unable to distinguish between the depth of the transition and the true depth.

Supplementary notes 2 the apparatus according to supplementary note 1, wherein the input image is a single color or gray image.

Supplementary note 3 the apparatus according to supplementary note 1, wherein the converter is trained until the difference between the depth of the conversion for each pixel and its true depth is less than a predetermined threshold.

Supplementary note 4 the apparatus of supplementary note 3, wherein the converter converts based on the extracted features of the input image.

Supplementary note 5 the apparatus according to supplementary note 3, wherein the classifier and the converter are trained together.

Supplementary notes 6. The apparatus according to supplementary note 5, wherein the parameters of the classifier and the parameters of the converter are updated alternately when training the classifier and the converter together.

Supplementary note 7. The apparatus according to supplementary note 6, wherein the parameters of the transducer are fixed when training the classifier, and the parameters of the classifier are fixed when training the transducer.

Supplementary note 8 the apparatus of supplementary note 1, wherein training the classifier includes minimizing a trained loss function.

Supplementary note 9 the apparatus of supplementary note 8, wherein the trained loss function is minimized based on a euclidean distance function or a cross entropy loss function.

Supplementary note 10. The apparatus of supplementary note 3, wherein training the converter includes causing the classifier to treat the depth of the conversion as the true depth.

Supplementary note 11 the apparatus of supplementary note 1, wherein the classifier is one of a support vector machine, a logistic regression, and a deep network classifier.

The apparatus of appendix 3, wherein the converter is one of a linear regression model and a convolutional neural network model.

Supplementary note 13. An image processing method includes:

converting, by a converter, an input image into a depth image to obtain a converted depth for each pixel of the input image; and

classifying by a classifier between the converted depth and the true depth from the depth dataset,

Supplementary note 14. The method of supplementary note 13, wherein the input image is a single color or grayscale image.

Supplementary note 15 the method of supplementary note 13, further comprising training the converter until a difference between the depth of conversion for each pixel and its true depth is less than a predetermined threshold.

Supplementary note 16. The method of supplementary note 15, wherein the converter converts based on extracted features of the input image.

Supplementary note 17. The method according to supplementary note 15, wherein the classifier and the converter are trained together.

Supplementary notes 18. The method according to supplementary note 17, wherein the parameters of the classifier and the parameters of the converter are updated alternately when training the classifier and the converter together.

Supplementary notes 19. The method according to supplementary notes 18, wherein the parameters of the converter are fixed when the classifier is trained, and the parameters of the classifier are fixed when the converter is trained.

Supplementary note 20 a program product comprising machine readable instruction code stored therein, wherein the instruction code, when read and executed by a computer, is capable of causing the computer to perform the method according to any of supplementary notes 13-19.

Claims

1. An image processing apparatus comprising:

wherein the classifier and the transducer are trained until the classifier is unable to distinguish between the depth of the transition and the true depth,

wherein the loss function of the classifier comprises: a first term representing a distance between a probability value obtained by the depth image obtained by the converter through the classifier and a first real tag; and a second term representing a distance between a probability value obtained by passing the real depth image through the classifier and a second real label, and the loss function of the classifier is a sum of the first term and the second term of all input images, and

wherein the loss function of the converter comprises: a third term representing a distance between the depth image obtained by the converter and the real depth image; and a fourth term representing a distance between a probability value obtained by the depth image obtained by the converter through the classifier and the second real label, and the loss function of the converter is a sum of the third term and the fourth term of all input images.

2. The apparatus of claim 1, wherein the input image is a single color or grayscale image.

3. The apparatus of claim 1, wherein the converter is trained until a difference between a depth of conversion for each pixel and its true depth is less than a predetermined threshold.

4. The apparatus of claim 3, wherein the converter converts based on extracted features of the input image.

5. The apparatus of claim 3, wherein the classifier and the converter are trained together.

6. The apparatus of claim 5, wherein the parameters of the classifier and the parameters of the transducer are updated alternately as the classifier and the transducer are trained together.

7. The apparatus of claim 6, wherein parameters of the transducer are fixed while the classifier is trained, and parameters of the classifier are fixed while the transducer is trained.

8. The apparatus of claim 1, wherein training the classifier comprises minimizing a trained loss function.

9. An image processing method, comprising:

10. A machine-readable storage medium having a program product embodied thereon, the program product comprising machine-readable instruction code stored therein, wherein the instruction code, when read and executed by a computer, causes the computer to perform the method of claim 9.