WO2022143366A1

WO2022143366A1 - Image processing method and apparatus, electronic device, medium, and computer program product

Info

Publication number: WO2022143366A1
Application number: PCT/CN2021/140683
Authority: WO
Inventors: 周芳汝; 杨玫; 安山
Original assignee: 北京沃东天骏信息技术有限公司; 北京京东世纪贸易有限公司
Priority date: 2021-01-04
Filing date: 2021-12-23
Publication date: 2022-07-07
Also published as: CN113781493A

Abstract

Embodiments of the present disclosure provide an image processing method and apparatus, an electronic device, a medium, and a computer program product. The method comprises: acquiring a target image, the target image comprising a target object and a non-target object; performing image segmentation processing and depth estimation processing on the target image to obtain a predicted segmentation map and a predicted depth map of the target image, respectively; determining the position of the target object in the predicted depth map of the target image according to a predicted segmentation map of the target object; and processing the predicted depth map according to the position of the target object in the predicted depth map of the target image to obtain a predicted depth map of the target object.

Description

Image processing method, apparatus, electronic device, medium and computer program product

This application claims the priority of Chinese Patent Application No. 202110002321.5 filed on January 4, 2021, the contents of which are incorporated herein by reference.

technical field

The embodiments of the present disclosure relate to the field of computer technology, and more particularly, to an image processing method, apparatus, electronic device, medium, and computer program product.

Background technique

In the field of computer vision, depth estimation is a part of 3D reconstruction, which requires estimation of depth information from 2D images. For some specific tasks, such as monocular robot dodging or finding the target object, the target object can be a person, segmenting the target object from the two-dimensional image, and estimating the depth of the target object are extremely important.

During the process of realizing the concept of the present disclosure, the inventor found that there are at least the following problems in the related art: it is difficult to realize the depth estimation for the target object in the target image by using the related art.

SUMMARY OF THE INVENTION

In view of this, embodiments of the present disclosure provide an image processing method, apparatus, electronic device, medium, and computer program product.

An aspect of the embodiments of the present disclosure provides an image processing method, including: acquiring a target image, wherein the target image includes a target object and a non-target object; performing image segmentation processing and depth estimation processing on the target image, respectively obtaining the above the predicted segmentation map and the predicted depth map of the target image; determine the position of the above-mentioned target object in the predicted depth map of the above-mentioned target image according to the predicted segmentation map of the above-mentioned target image; and, according to the predicted depth map of the above-mentioned target image The position of the above-mentioned target object The predicted depth map is processed to obtain the predicted depth map of the target object.

Another aspect of the embodiments of the present disclosure provides an image processing apparatus, including: an acquisition module for acquiring a target image, wherein the target image includes a target object and a non-target object; a first processing module for acquiring the target image The image is subjected to image segmentation processing and depth estimation processing to obtain the predicted segmentation map and the predicted depth map of the above-mentioned target image respectively; the determining module is used to determine the predicted depth map of the above-mentioned target image according to the predicted segmentation map of the above-mentioned target object. and a second processing module, configured to process the predicted depth map according to the position of the target object in the predicted depth map of the target image to obtain the predicted depth map of the target object.

Another aspect of the embodiments of the present disclosure provides an electronic device, including: one or more processors; and a memory for storing one or more programs, wherein when the one or more programs are processed by the one or more programs When executed by the processor, the above-mentioned one or more processors are caused to implement the above-mentioned method.

Another aspect of the embodiments of the present disclosure provides a computer-readable storage medium having executable instructions stored thereon, the instructions, when executed by a processor, cause the processor to implement the above method.

Another aspect of the embodiments of the present disclosure provides a computer program product, where the computer program product includes a computer program, and the computer program is used to implement the above method when executed by a processor.

According to an embodiment of the present disclosure, by acquiring a target image, the target image includes a target object and a non-target object, performing image segmentation processing and depth estimation processing on the target image, and obtaining a predicted segmentation map and a predicted depth map of the target image, respectively, according to the target object. The predicted segmentation map of the target image determines the position of the target object in the predicted depth map of the target image, and processes the predicted depth map according to the position of the target object in the predicted depth map of the target image to obtain the predicted depth map of the target object. Due to the combination of image segmentation and depth estimation, the position of the target object in the predicted depth map can be obtained according to the predicted segmentation map, and the predicted depth map of the target object can be obtained by processing the predicted depth map according to the position of the target object in the predicted depth map, Therefore, the technical problem of difficulty in realizing depth estimation for the target object in the target image in the related art is at least partially overcome, and the depth of the target object in the target image is more accurately determined, and the method has strong generalization.

Description of drawings

The above and other objects, features and advantages of the present disclosure will become more apparent from the following description of embodiments of the present disclosure with reference to the accompanying drawings, in which:

FIG. 1 schematically shows an exemplary system architecture to which the image processing method and apparatus according to the embodiments of the present disclosure can be applied;

FIG. 2 schematically shows a flowchart of an image processing method according to an embodiment of the present disclosure;

FIG. 3 schematically shows a structural diagram of an image processing model according to an embodiment of the present disclosure;

FIG. 4 schematically shows a flowchart of another image processing method according to an embodiment of the present disclosure;

FIG. 5 schematically shows a schematic diagram of a target image according to an embodiment of the present disclosure;

FIG. 6 schematically shows a predicted depth map of a target image according to an embodiment of the present disclosure;

FIG. 7 schematically shows a predicted segmentation map of a target image according to an embodiment of the present disclosure;

FIG. 8 schematically shows a predicted depth map of a target object according to an embodiment of the present disclosure;

FIG. 9 schematically shows a predicted depth map of another target object according to an embodiment of the present disclosure;

FIG. 10 schematically shows a schematic diagram of still another target object according to an embodiment of the present disclosure;

FIG. 11 schematically shows a flowchart of still another image processing method according to an embodiment of the present disclosure;

FIG. 12 schematically shows a flowchart of still another image processing method according to an embodiment of the present disclosure;

FIG. 13 schematically shows a block diagram of an image processing apparatus according to an embodiment of the present disclosure; and

FIG. 14 schematically shows a block diagram of an electronic device suitable for an image processing method according to an embodiment of the present disclosure.

Detailed ways

Hereinafter, embodiments of the present disclosure will be described with reference to the accompanying drawings. It should be understood, however, that these descriptions are exemplary only, and are not intended to limit the scope of the present disclosure. In the following detailed description, for convenience of explanation, numerous specific details are set forth in order to provide a thorough understanding of the embodiments of the present disclosure. It will be apparent, however, that one or more embodiments may be practiced without these specific details. Also, in the following description, descriptions of well-known structures and techniques are omitted to avoid unnecessarily obscuring the concepts of the present disclosure.

The terminology used herein is for the purpose of describing particular embodiments only and is not intended to limit the present disclosure. The terms "comprising", "comprising" and the like as used herein indicate the presence of stated features, steps, operations and/or components, but do not preclude the presence or addition of one or more other features, steps, operations or components.

All terms (including technical and scientific terms) used herein have the meaning as commonly understood by one of ordinary skill in the art, unless otherwise defined. It should be noted that terms used herein should be construed to have meanings consistent with the context of the present specification and should not be construed in an idealized or overly rigid manner.

Where expressions like "at least one of A, B, and C, etc.," are used, they should generally be interpreted in accordance with the meaning of the expression as commonly understood by those skilled in the art (eg, "has A, B, and C") At least one of the "systems" shall include, but not be limited to, systems with A alone, B alone, C alone, A and B, A and C, B and C, and/or A, B, C, etc. ). Where expressions like "at least one of A, B, or C, etc.," are used, they should generally be interpreted in accordance with the meaning of the expression as commonly understood by those skilled in the art (eg, "has A, B, or C, etc." At least one of the "systems" shall include, but not be limited to, systems with A alone, B alone, C alone, A and B, A and C, B and C, and/or A, B, C, etc. ).

Embodiments of the present disclosure provide an image processing method, an image processing apparatus, and an electronic device applying the method. The method includes acquiring a target image, wherein the target image includes a target object and a non-target object. Perform image segmentation processing and depth estimation processing on the target image, and obtain the predicted segmentation map and the predicted depth map of the target image, respectively. The location of the target object in the predicted depth map of the target image is determined according to the predicted segmentation map of the target image. The predicted depth map is processed according to the position of the target object in the predicted depth map of the target image to obtain the predicted depth map of the target object.

FIG. 1 schematically shows an exemplary system architecture 100 to which an image processing method or apparatus may be applied according to an embodiment of the present disclosure. It should be noted that FIG. 1 is only an example of a system architecture to which the embodiments of the present disclosure can be applied, so as to help those skilled in the art to understand the technical content of the present disclosure, but it does not mean that the embodiments of the present disclosure cannot be used for other A device, system, environment or scene.

As shown in FIG. 1 , the system architecture 100 according to this embodiment may include

terminal devices

101 , 102 , and 103 , a network 104 and a server 105 . The network 104 is a medium used to provide a communication link between the

terminal devices

101 , 102 , 103 and the server 105 . The network 104 may include various connection types, such as wired and/or wireless communication links, and the like.

The user can use the

terminal devices

101, 102, 103 to interact with the server 105 through the network 104 to receive or send messages and the like. Various communication client applications may be installed on the

terminal devices

101, 102 and 103, such as image processing applications, model building applications, search applications, instant messaging tools, email clients and/or social platform software, etc. (only example).

The

terminal devices

101, 102, and 103 may be various electronic devices having a display screen and supporting image processing and web browsing, including but not limited to smart phones, tablet computers, laptop computers, and desktop computers.

The server 105 may be a server that provides various services, such as a background management server (just an example) that provides support for websites browsed by users using the

terminal devices

101, 102, 103, processed pictures, and the like. The background management server can process, analyze and save the received images, camera information and other data, and feed back the processing results (such as web pages, information, or data obtained or generated according to user requests) to the terminal device.

It should be noted that, the image processing method provided by the embodiment of the present disclosure may generally be executed by the server 105 . Correspondingly, the image processing apparatus provided by the embodiments of the present disclosure may generally be provided in the server 105 . The image processing method provided by the embodiment of the present disclosure may also be executed by a server or server cluster that is different from the server 105 and can communicate with the

terminal devices

101 , 102 , 103 and/or the server 105 . Correspondingly, the image processing apparatus provided by the embodiments of the present disclosure may also be provided in a server or a server cluster that is different from the server 105 and can communicate with the

terminal devices

101 , 102 , 103 and/or the server 105 . Alternatively, the image processing method provided by the embodiment of the present disclosure may also be executed by the

terminal device

101 , 102 or 103 , or may also be executed by other terminal device different from the

terminal device

101 , 102 or 103 . Correspondingly, the image processing apparatus provided by the embodiments of the present disclosure may also be provided in the

terminal device

101 , 102 or 103 , or in other terminal devices different from the

terminal device

101 , 102 or 103 .

For example, the target image may be originally stored in any one of the terminal devices 101 , 102 or 103 (eg, the terminal device 101 , but not limited thereto), or stored on an external storage device and imported into the terminal device 101 . Then, the terminal device 101 may locally execute the image processing method provided by the embodiments of the present disclosure, or send the target image to other terminal devices, servers, or server clusters, and the other terminal devices, servers, or A server cluster is used to execute the image processing method provided by the embodiments of the present disclosure.

It should be understood that the numbers of terminal devices, networks and servers in FIG. 1 are merely illustrative. There can be any number of terminal devices, networks and servers according to implementation needs.

FIG. 2 schematically shows a flowchart of an image processing method according to an embodiment of the present disclosure.

As shown in FIG. 2, the method includes operations S210-S240.

In operation S210, a target image is acquired, wherein the target image includes a target object and a non-target object.

According to an embodiment of the present disclosure, the target image may be a monocular image, and the target image may include a target object and a non-target object, wherein the target object may be a person in the target image, and the non-target object may be a background object in the target image etc., such as desks, trees, cars, etc., but not limited to this, any target can also be designated as a target object according to actual needs, and objects of a different category from the designated target object are non-target objects. The number of target objects can include one or more.

In operation S220, image segmentation processing and depth estimation processing are performed on the target image to obtain a predicted segmentation map and a predicted depth map of the target image, respectively.

According to an embodiment of the present disclosure, performing depth estimation processing on the target image may be performing depth estimation on each pixel point on the target image according to the depth relationship reflected by the pixel value relationship.

According to an embodiment of the present disclosure, the image segmentation may be a method of semantic segmentation, but is not limited thereto, and may also be a method of instance segmentation. Semantic segmentation can classify each pixel in the image, and obtain a semantic segmentation mask corresponding to the image size, that is, the predicted semantic segmentation map of the image, but semantic segmentation does not distinguish different objects of the same category, that is Instances are not distinguished. Instance segmentation can not only realize the category division of pixels, but also distinguish different objects of the same category, that is, distinguish instances. By instance segmentation of the image, an instance segmentation mask corresponding to the image size can be obtained, that is, the Predict instance segmentation map. The predicted semantic segmentation map of the image and the predicted instance segmentation map of the image may be collectively referred to as the predicted segmentation map of the image. In the embodiment of the present disclosure, different categories may be represented by different colors, and correspondingly, different colors in the predicted segmentation map of the image represent different categories.

According to the embodiments of the present disclosure, semantic segmentation or instance segmentation can be used to perform image segmentation processing on the target image to obtain a predicted segmentation map of the target image. The size of the predicted segmentation map of the target image may be determined according to the actual situation, which is not specifically limited here. For example, the predicted segmentation map size of the target image is the same as the size of the target image. Alternatively, the predicted segmentation map size of the target image is one-half the size of the target image.

According to the embodiments of the present disclosure, using the semantic segmentation method to perform image segmentation processing can achieve real-time speed, so as to meet the processing requirements of real-time tasks.

In operation S230, the position of the target object in the predicted depth map of the target image is determined according to the predicted segmentation map of the target object.

According to the embodiments of the present disclosure, the position of the target object in the predicted segmentation map of the target image can be determined according to the predicted segmentation map of the target object in the predicted segmentation map of the target image. The position of the target object in the predicted depth map of the image corresponds to the position of the target object. Therefore, the position of the target object in the predicted depth map of the target image can be obtained based on the position of the target object in the predicted segmentation map of the target image.

In operation S240, the predicted depth map is processed according to the position of the target object in the predicted depth map of the target image to obtain the predicted depth map of the target object.

According to the embodiments of the present disclosure, in implementing the related art of the present disclosure, depth estimation can be performed on a monocular image by combining gradient and texture features. That is, the gradient information and texture information of the target image can be used as depth cues to assist the deep convolutional network to learn the depth information of the target image, and the predicted depth map of the target image can be obtained. The target image can also be fed into a depth estimation network based on camera pose estimation to obtain a predicted depth map of the target image.

However, the above methods can only perform depth estimation on the entire target image, and it is difficult to extract the depth of the target object in the target image.

In implementing the related art of the present disclosure, a sample-based learning method can also be used to realize depth estimation. The sample-based learning method is to construct a data set, convert the problem of depth estimation of the target object into a retrieval problem, and use the method of feature matching to retrieve the data set to obtain the depth estimation result of the target object in the target image. The sample-based learning method can estimate the depth of the target object in the target image that has a matching relationship with the image in the dataset. But if an image matching the target image cannot be retrieved in the dataset, depth estimation for the target object in the target image cannot be achieved. In addition, the above method can estimate the depth of the target object in the target image, but cannot obtain the position of the target object in the target image. This method has poor generalization and low estimation accuracy.

According to the embodiments of the present disclosure, since image segmentation and depth estimation are combined, the position of the target object in the predicted depth map of the target image can be obtained according to the predicted segmentation map of the target image, and the target object in the predicted depth map of the target image can be obtained according to the predicted segmentation map of the target image. The predicted depth map of the target object can be obtained by processing the predicted depth map at the position of the target object, so the depth of the target object in the target image can be more accurately determined. In addition, since the depth estimation is not implemented by means of retrieval and matching, it is less affected by the samples, and therefore, the solution of the embodiments of the present disclosure has strong generalization.

It should be noted that, if the number of target objects includes at least two, for each target object, the methods provided in the above operations S210 to S240 may be used to obtain the predicted depth map of the target object. In terms of presentation, the predicted depth maps of each target object may be presented on one predicted depth map, or the predicted depth maps of each target object may be presented separately.

According to the technical solutions of the embodiments of the present disclosure, by acquiring a target image, the target image includes a target object and a non-target object, performing image segmentation processing and depth estimation processing on the target image, and obtaining a predicted segmentation map and a predicted depth map of the target image, respectively, according to The predicted segmentation map of the target object determines the position of the target object in the predicted depth map of the target image, and processes the predicted depth map according to the position of the target object in the predicted depth map of the target image to obtain the predicted depth map of the target object. Due to the combination of image segmentation and depth estimation, the position of the target object in the predicted depth map can be obtained according to the predicted segmentation map, and the predicted depth map of the target object can be obtained by processing the predicted depth map according to the position of the target object in the predicted depth map. Therefore, the technical problem that the depth estimation of the target object in the target image is difficult to achieve in the related art is at least partially overcome, and the depth of the target object in the target image is more accurately determined, and the generalization of the method is strong.

The method shown in FIG. 2 will be further described below with reference to FIGS. 3 to 10 in conjunction with specific embodiments.

FIG. 3 schematically shows a structural diagram of an image processing model according to an embodiment of the present disclosure.

As shown in Figure 3, the image processing model includes a feature extraction network, an image segmentation network and a depth estimation network.

According to an embodiment of the present disclosure, the present disclosure constructs an image processing model, which is obtained by training an encoding-decoding network with training samples, that is, inputting the target image to be predicted into the image processing model, and outputting the target image respectively The predicted segmentation map and the predicted depth map are obtained, and then the predicted depth map of the target object of the target image is obtained. According to the embodiments of the present disclosure, the image processing model constructed in the present disclosure makes up for the technical deficiency in the related art that the predicted segmentation map including the target image and the predicted depth map cannot be simultaneously output.

FIG. 4 schematically shows a flowchart of another image processing method according to an embodiment of the present disclosure.

As shown in FIG. 4 , using the image processing model to process the target image and respectively obtaining the predicted segmentation map and the predicted depth map of the target image may include the following operations S410 to S460 .

In operation S410, a feature extraction network is used to process the target image to obtain a first intermediate feature map.

In operation S420, the image segmentation network is used to process the first intermediate feature map to obtain a second intermediate feature map.

In operation S430, the depth estimation network is used to process the first intermediate feature map to obtain a third intermediate feature map.

In operation S440, a fourth intermediate feature map is generated according to the second intermediate feature map and the third intermediate feature map.

In operation S450, a depth estimation network is used to process the fourth intermediate feature map to obtain a predicted depth map of the target image.

In operation S460, the image segmentation network is used to process the second intermediate feature map to obtain a predicted segmentation map of the target image.

As shown in Figure 3 and Figure 4, in order to achieve real-time depth estimation of the target object, the MobileNet+ASPP (Atrous Spatial Pyramid Pooling) module is used as the feature extraction network, that is, the encoding network. Use depthwise separable convolutions as decoding networks, i.e. image segmentation network and depth estimation network.

According to an embodiment of the present disclosure, the height and width of the target image can be recorded as H and W, respectively. Input the target image to the MobileNet module, and output a feature map f ₂ of size

In this process, the output of the intermediate layer is a feature map f ₁ whose size is

The feature map f ₂ can be input into the ASPP module, and the obtained output is fused with the feature map f ₂ to output the feature map f ₃ . Upsample the feature map _f3 to

Then, the feature maps f1 and _f3 are fused, and finally the feature map _f4 is used as the output of the feature extraction network, and the feature map _f4 is the _first intermediate feature map.

According to an embodiment of the present disclosure, both the image segmentation network and the depth estimation network take the same feature map f ₄ (ie, the first intermediate feature map) as input, and output the second intermediate feature map (ie, the feature map f ₅ ) and the third intermediate feature map respectively. The intermediate feature map (ie, feature map _f6 ). It should be noted that, since the first intermediate feature map processed by the feature extraction network is used as the input of the depth estimation network and the image segmentation network, respectively, the feature extraction network comprehensively considers the detailed information and abstract information of the target image.

According to the embodiments of the present disclosure, since in the predicted depth map, the depth values of the same target object are relatively close, and the gradient of the depth value at the boundary of the target object may be larger, therefore, in order to obtain higher precision To predict the depth map, the second intermediate feature map (ie, feature map f ₅ ) output by the image segmentation network can be input into the depth estimation network, and combined with the third intermediate feature map (ie, feature map f ₆ ) to obtain the fourth intermediate feature map feature map (ie, feature map _f7 ). The fourth intermediate feature map is obtained by convolution and upsampling with a size of

the predicted depth map. The size of the second intermediate feature map is obtained by convolution and upsampling as

The predicted segmentation map of .

According to the embodiment of the present disclosure, by inputting the second intermediate feature map in the image segmentation network into the depth estimation network, the prediction result of the depth estimation network is corrected, so that the prediction result of the depth estimation is more accurate.

According to an embodiment of the present disclosure, processing the predicted depth map according to the position of the target object in the predicted depth map of the target image to obtain the predicted depth map of the target object may include the following operations.

The pixel values of other positions in the predicted depth map of the target image except the position of the target image are set as preset pixel values to obtain the predicted depth map of the target object.

According to the embodiments of the present disclosure, the pixel values of other positions in the predicted depth map of the target image other than the position of the target image can be set as preset pixel values to obtain the predicted depth map of the target object. The preset pixel value may be set according to the actual situation, which is not specifically limited here, for example, the preset pixel value may be 0.

Exemplarily, FIG. 5 schematically shows a schematic diagram of a target image according to an embodiment of the present disclosure. FIG. 6 schematically illustrates a predicted depth map of a target image according to an embodiment of the present disclosure. FIG. 7 schematically shows a predicted segmentation map of a target image according to an embodiment of the present disclosure. FIG. 8 schematically shows a predicted depth map of a target object according to an embodiment of the present disclosure. The target object in Figure 8 is a person. FIG. 9 schematically shows a predicted depth map of another target object according to an embodiment of the present disclosure. The target object in Figure 9 is a refrigerator. FIG. 10 schematically shows a schematic diagram of still another target object according to an embodiment of the present disclosure. The number of target objects in FIG. 10 includes a plurality.

As shown in FIGS. 5 to 10 , the predicted depth map of the target object is obtained by applying the predicted segmentation map of the target image to the predicted depth map of the target image. In FIGS. 7 to 10 , black represents the non-target area.

According to an embodiment of the present disclosure, an image processing model is used to process a target image, and a predicted segmentation map and a predicted depth map of the target image are obtained respectively, wherein the image processing model is obtained by training using training samples, wherein the training samples include sample images and samples Depth labels and segmentation labels for images.

According to an embodiment of the present disclosure, the image processing model is obtained by training using training samples, and may include the following operations.

Get training samples. Use the training samples to train a fully convolutional neural network model to obtain an image processing model.

According to an embodiment of the present disclosure, the fully convolutional neural network model includes an initial feature extraction network, an initial image segmentation network, and an initial depth estimation network.

According to an embodiment of the present disclosure, using training samples to train a fully convolutional neural network model to obtain an image processing model may include the following operations.

The sample image is processed using the initial feature extraction network to obtain a fifth intermediate feature map. Using the initial image segmentation network to process the fifth intermediate feature map, a sixth intermediate feature map is obtained. The fifth intermediate feature map is processed using the initial depth estimation network to obtain the seventh intermediate feature map. According to the sixth intermediate feature map and the seventh intermediate feature map, an eighth intermediate feature map is generated. The eighth intermediate feature map is processed using the initial depth estimation network to obtain the predicted depth map of the sample image. Using the initial image segmentation network to process the sixth intermediate feature map, a predicted segmentation map of the sample image is obtained. Input the depth label, predicted depth map, segmentation label and predicted segmentation map of the sample image into the loss function of the fully convolutional neural network model, and output the loss result. Adjust the network parameters of the fully convolutional neural network model according to the loss results until the loss function converges. Use the trained fully convolutional neural network model as an image processing model.

According to an embodiment of the present disclosure, it can be understood with reference to FIG. 3 , in the training process of the image processing model, the fifth intermediate feature map can be understood as the feature map f ₄ in FIG. 3 , and the sixth intermediate feature map can be understood as The feature map f ₅ in FIG. 3 and the seventh intermediate feature map can be understood as the feature map f ₆ in FIG. 3 , and the eighth intermediate feature map can be understood as the feature map f ₇ in FIG. 3 . It should be noted that, after the training is completed, the initial feature extraction network, the initial image segmentation network and the initial depth estimation network are respectively called the feature extraction network, the image segmentation network and the depth estimation network.

FIG. 11 schematically shows a flowchart of still another image processing method according to an embodiment of the present disclosure.

As shown in FIG. 11 , performing image segmentation processing on the sample image to obtain the segmentation label of the sample image may include the following operations S1110 to S1130.

In operation S1110, an instance segmentation process is performed on the sample image to obtain an instance segmentation label of the sample image.

In operation S1120, the label is segmented according to the instance of the sample image to obtain the semantic segmentation label of the sample image.

In operation S1130, the semantic segmentation label of the sample image is used as the segmentation label of the sample image.

According to an embodiment of the present disclosure, the instance is segmented into images that may include multiple instances belonging to the same category, which need to be distinguished. For example, for a target image, the target image may include a number of people belonging to the category of people, that is, it includes multiple people. In instance segmentation, it is necessary to distinguish these multiple people, and each person can get the corresponding instance Split labels.

According to an embodiment of the present disclosure, semantic segmentation is to classify each pixel in the image, but does not distinguish instances. For example, for a target image, the target image may include multiple people belonging to the category of people, that is, including multiple people. In semantic segmentation, there is no need to distinguish these multiple people, and multiple people get the same semantic segmentation. Label.

According to an embodiment of the present disclosure, in order to make more accurate semantic segmentation labels, a Mask_RCNN (Recurrent Convolutional Neural Network) network can be used to output instance segmentation on three depth-estimated sample image databases CAD_60, CAD_120 and EPFL labels, and then convert the instance segmentation labels to semantic segmentation labels as semantic segmentation labels on the sample image database for three depth estimates. But it is not limited to this, and semantic segmentation tags can also be used alone. Since the embodiment of the present disclosure adopts the sample image database for depth estimation, the depth label of the sample image can be obtained.

According to other embodiments of the present disclosure, Mask_RCNN can detect and segment objects of 82 categories (including background categories). In practical applications, there may be less than 82 categories in the sample image database for depth estimation. , directly using 82 categories as segmentation labels will expand the segmentation range, resulting in an increase in the error probability of segmentation processing.

According to an embodiment of the present disclosure, 59 of these categories appear in the sample image database used in order to construct dense segmentation labels. As shown in Table 1, the categories of Mask_RCNN can be mapped, and the categories that are not involved are marked as -1, so as to reduce the error probability and improve the segmentation effect and accuracy on the basis of image segmentation processing.

Table 1

According to the embodiments of the present disclosure, the result of instance segmentation labels obtained by using the Mask_RCNN network has high accuracy, which is beneficial to the construction of an image processing model.

According to the embodiments of the present disclosure, semantic segmentation can also be directly performed on a sample image database with depth labels to obtain the semantic segmentation labels of the sample images. In addition, on the basis of the sample image database with depth labels, image segmentation is performed on the sample images to obtain the segmentation labels of the sample images, so that the sample images have both depth labels and segmentation labels. The segmentation label of the known sample image can also be used to perform depth estimation on the sample image to obtain the depth label of the sample image, so that the sample image has both the depth label and the segmentation label. However, the accuracy of the predicted depth map of the target object in the target image obtained by the first method is higher.

The technical solutions of the present disclosure will be further described below with reference to specific embodiments, and the operations of the image processing method may be specifically as follows.

FIG. 12 schematically shows a flowchart of yet another image processing method according to an embodiment of the present disclosure.

According to an embodiment of the present disclosure, as shown in FIG. 12 , the operations of the image processing method may include.

Create a dataset, build a fully convolutional neural network model, and train to obtain an image processing model.

The input image is normalized to obtain an RGB image, which is used as the input of the image processing model to obtain the predicted depth map and predicted segmentation map of the target image, respectively.

Determine whether the predicted segmentation map corresponding to the predicted segmentation map of the target image belongs to the target object, and if so, apply the predicted segmentation map of the target object to the predicted depth map of the target image to obtain the predicted depth map of the target object.

In addition, according to the embodiments of the present disclosure, it can also be determined whether to traverse the predicted segmentation maps of all target objects. If not, traverse the predicted segmentation map of the target object of the next category; if so, output the predicted depth map of all target objects.

According to the embodiments of the present disclosure, the image processing method can be applied to an application scenario in which a monocular robot finds or avoids a specific object, wherein the characteristic object is the target object.

FIG. 13 schematically shows a block diagram of an image processing apparatus according to an embodiment of the present disclosure.

As shown in FIG. 13 , the image processing apparatus 1300 includes an acquisition module 1310 , a first processing module 1320 , a determination module 1330 and a second processing module 1340 .

The acquiring module 1310 , the first processing module 1320 , the determining module 1330 and the second processing module 1340 are connected in communication.

The acquiring module 1310 is configured to acquire a target image, wherein the target image includes a target object and a non-target object.

The first processing module 1320 is configured to perform image segmentation processing and depth estimation processing on the target image to obtain a predicted segmentation map and a predicted depth map of the target image, respectively.

The determining module 1330 is configured to determine the position of the target object in the predicted depth map of the target image according to the predicted segmentation map of the target object.

The second processing module 1340 is configured to process the predicted depth map according to the position of the target object in the predicted depth map of the target object to obtain the predicted depth map of the target object.

According to the technical solutions of the embodiments of the present disclosure, by acquiring a target image, the target image includes a target object and a non-target object, and performing image segmentation processing and depth estimation processing on the target image to obtain a predicted segmentation map and a predicted depth map of the target image, respectively. The predicted segmentation map of the target object determines the position of the target object in the predicted depth map of the target image, and processes the predicted depth map according to the position of the target object in the predicted depth map of the target image to obtain the predicted depth map of the target object. Due to the combination of image segmentation and depth estimation, the position of the target object in the predicted depth map can be obtained according to the predicted segmentation map, and the predicted depth map of the target object can be obtained by processing the predicted depth map according to the position of the target object in the predicted depth map. Therefore, the technical problem of difficulty in realizing depth estimation of the target object in the target image in the related art is at least partially overcome, and the depth of the target object in the target image can be more accurately determined, and the method has strong generalization.

According to an embodiment of the present disclosure, the first processing module 1320 includes a first processing unit.

The first processing unit is used to process the target image by using an image processing model to obtain a segmentation map and a depth map respectively, wherein the image processing model is obtained by training with training samples, wherein the training samples include the sample image and the depth label and the depth label of the sample image. Split labels.

According to an embodiment of the present disclosure, the image processing model includes a feature extraction network, an image segmentation network, and a depth estimation network.

According to an embodiment of the present disclosure, the first processing unit includes a first processing subunit, a second processing subunit, a third processing subunit, a fourth processing subunit, a fifth processing subunit, and a sixth processing subunit.

The first processing subunit is used to process the target image by using the feature extraction network to obtain the first intermediate feature map.

The second processing subunit is used to process the first intermediate feature map by using the image segmentation network to obtain the second intermediate feature map.

The third processing subunit is used to process the first intermediate feature map by using the depth estimation network to obtain a third intermediate feature map.

The fourth processing subunit is configured to generate a fourth intermediate feature map according to the second intermediate feature map and the third intermediate feature map.

The fifth processing subunit is used for processing the fourth intermediate feature map by using the depth estimation network to obtain the predicted depth map of the target image.

The sixth processing subunit is used to process the second intermediate feature map by using the image segmentation network to obtain the predicted segmentation map of the target image.

Using the training samples to train a fully convolutional neural network model to obtain an image processing model may include the following operations.

The sample image is processed using the initial feature extraction network to obtain a fifth intermediate feature map. Using the initial image segmentation network to process the fifth intermediate feature map, a sixth intermediate feature map is obtained. The fifth intermediate feature map is processed using the initial depth estimation network to obtain a seventh intermediate feature map. According to the sixth intermediate feature map and the seventh intermediate feature map, an eighth intermediate feature map is generated. The eighth intermediate feature map is processed using the initial depth estimation network to obtain the predicted depth map of the sample image. Using the initial image segmentation network to process the sixth intermediate feature map, a predicted segmentation map of the sample image is obtained. Input the depth label, predicted depth map, segmentation label and predicted segmentation map of the sample image into the loss function of the fully convolutional neural network model to obtain the loss result. Adjust the network parameters of the fully convolutional neural network model according to the loss results until the loss function converges. Use the trained fully convolutional neural network model as an image processing model.

According to an embodiment of the present disclosure, performing image segmentation processing on a sample image to obtain a segmentation label of the sample image may include the following operations.

Instance segmentation is performed on the sample image to obtain the instance segmentation label of the sample image. According to the instance segmentation label of the sample image, the semantic segmentation label of the sample image is obtained. The semantic segmentation label of the sample image is used as the segmentation label of the sample image.

According to an embodiment of the present disclosure, the second processing module 1320 includes a second processing unit.

The second processing unit is configured to set the pixel values of other positions in the predicted depth map of the target image except the position of the target image as preset pixel values to obtain the predicted depth map of the target object.

According to an embodiment of the present disclosure, image segmentation processing includes semantic segmentation processing or instance segmentation processing.

Any of the modules, units, or at least part of the functions of any of the modules according to the embodiments of the present disclosure may be implemented in one module. Any one or more of the modules and units according to the embodiments of the present disclosure may be divided into multiple modules for implementation. Any one or more of the modules and units according to the embodiments of the present disclosure may be at least partially implemented as hardware circuits, such as Field Programmable Gate Arrays (FPGA), Programmable Logic Arrays (Programmable Logic Arrays, PLA), system-on-chip, system-on-substrate, system-on-package, Application Specific Integrated Circuit (ASIC), or any other reasonable means of hardware or firmware that can integrate or package a circuit, Or it can be implemented in any one of the three implementation manners of software, hardware and firmware, or in an appropriate combination of any of them. Alternatively, one or more of the modules and units according to the embodiments of the present disclosure may be implemented at least in part as computer program modules, which, when executed, may perform corresponding functions.

For example, any one of the acquisition module 1310, the first processing module 1320, the determination module 1330, and the second processing module 1340 may be combined in one module/unit for implementation, or any one of the modules/units may be split into multiple modules/units. Alternatively, at least part of the functionality of one or more of these modules/units may be combined with at least part of the functionality of other modules/units and implemented in one module/unit. According to an embodiment of the present disclosure, at least one of the acquisition module 1310, the first processing module 1320, the determination module 1330, and the second processing module 1340 may be implemented at least partially as a hardware circuit, such as a Field Programmable Gate Array (FPGA), Programmable logic array (PLA), system-on-chip, system-on-substrate, system-on-package, application-specific integrated circuit (ASIC), or hardware or firmware that can be implemented by any other reasonable means of integrating or packaging circuits, Or it can be implemented in any one of the three implementation manners of software, hardware and firmware, or in an appropriate combination of any of them. Alternatively, at least one of the acquisition module 1310, the first processing module 1320, the determination module 1330, and the second processing module 1340 may be implemented at least partially as a computer program module, which, when executed, may perform corresponding functions .

It should be noted that the image processing apparatus part in the embodiment of the present disclosure corresponds to the image processing method part in the embodiment of the present disclosure, and the description of the image processing apparatus part refers to the image processing method part, which is not repeated here.

Figure 14 schematically shows a block diagram of an electronic device suitable for implementing the method described above, according to an embodiment of the present disclosure. The electronic device shown in FIG. 14 is only an example, and should not impose any limitation on the function and scope of use of the embodiments of the present disclosure.

As shown in FIG. 14, an electronic device 1400 according to an embodiment of the present disclosure includes a processor 1401, which can be loaded into a random access memory according to a program stored in a read-only memory (Read-Only Memory, ROM) 1402 or from a storage part 1408 (Random Access Memory, RAM) program in 1403 to execute various appropriate actions and processes. The processor 1401 may include, for example, a general-purpose microprocessor (eg, a CPU), an instruction set processor and/or a related chipset, and/or a special-purpose microprocessor (eg, an application-specific integrated circuit (ASIC)), among others. The processor 1401 may also include on-board memory for caching purposes. The processor 1401 may include a single processing unit or multiple processing units for performing different actions of the method flow according to the embodiments of the present disclosure.

In the RAM 1403, various programs and data necessary for the operation of the electronic device 1400 are stored. The processor 1401, the ROM 1402, and the RAM 1403 are connected to each other through a bus 1404. The processor 1401 performs various operations of the method flow according to an embodiment of the present disclosure by executing programs in the ROM 1402 and/or the RAM 1403. Note that the program may also be stored in one or more memories other than ROM 1402 and RAM 1403. The processor 1401 may also perform various operations of the method flow according to the embodiments of the present disclosure by executing programs stored in the one or more memories.

According to an embodiment of the present disclosure, the electronic device 1400 may also include an input/output (I/O) interface 1405 that is also connected to the bus 1404 . System 1400 may also include one or more of the following components connected to I/O interface 1405: input portion 1406 including keyboard, mouse, etc.; including components such as cathode ray tube (CRT), liquid crystal display (LCD) ) etc. and an output section 1407 of speakers and the like; a storage section 1408 including a hard disk and the like; and a communication section 1409 including a network interface card such as a LAN card, a modem and the like. The communication section 1409 performs communication processing via a network such as the Internet. Drivers 1410 are also connected to I/O interface 1405 as needed. A removable medium 1411, such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, etc., is mounted on the drive 1410 as needed so that a computer program read therefrom is installed into the storage section 1408 as needed.

According to an embodiment of the present disclosure, the method flow according to an embodiment of the present disclosure may be implemented as a computer software program. For example, embodiments of the present disclosure include a computer program product comprising a computer program carried on a computer-readable storage medium, the computer program containing program code for performing the method illustrated in the flowchart. In such an embodiment, the computer program may be downloaded and installed from the network via the communication portion 1409, and/or installed from the removable medium 1411. When the computer program is executed by the processor 1401, the above-described functions defined in the system of the embodiment of the present disclosure are performed. According to embodiments of the present disclosure, the above-described systems, apparatuses, apparatuses, modules, units, etc. can be implemented by computer program modules.

The present disclosure also provides a computer-readable storage medium. The computer-readable storage medium may be included in the device/apparatus/system described in the above embodiments; it may also exist alone without being assembled into the device/system. device/system. The above-mentioned computer-readable storage medium carries one or more programs, and when the above-mentioned one or more programs are executed, implement the method according to the embodiment of the present disclosure.

According to an embodiment of the present disclosure, the computer-readable storage medium may be a non-volatile computer-readable storage medium. Examples may include, but are not limited to, portable computer disks, hard disks, random access memory (RAM), read only memory (ROM), erasable programmable read only memory (Erasable Programmable Read Only Memory, EPROM or flash memory), portable computer Compact disk read-only memory (Computer Disc Read-Only Memory, CD-ROM), optical storage device, magnetic storage device, or any suitable combination of the above. In this disclosure, a computer-readable storage medium may be any tangible medium that contains or stores a program that can be used by or in conjunction with an instruction execution system, apparatus, or device.

For example, according to embodiments of the present disclosure, a computer-readable storage medium may include one or more memories other than ROM 1402 and/or RAM 1403 and/or ROM 1402 and RAM 1403 described above.

According to the technical solutions of the embodiments of the present disclosure, by acquiring a target image, the target image includes a target object and a non-target object, and performing image segmentation processing and depth estimation processing on the target image to obtain a predicted segmentation map and a predicted depth map of the target image, respectively. The predicted segmentation map of the target object determines the position of the target object in the predicted depth map of the target image, and processes the predicted depth map according to the position of the target object in the predicted depth map of the target image to obtain the predicted depth map of the target object. Due to the combination of image segmentation and depth estimation, the position of the target object in the predicted depth map can be obtained according to the predicted segmentation map, and the predicted depth map of the target object can be obtained by processing the predicted depth map according to the position of the target object in the predicted depth map. Therefore, the technical problem that the depth estimation of the target object in the target image is difficult to achieve in the related art is at least partially overcome, and the depth of the target object in the target image is more accurately determined, and the generalization of the method is strong.

The embodiments of the present disclosure also include a computer program product, which includes a computer program, the computer program includes program codes for executing the methods provided by the embodiments of the present disclosure, and when the computer program product runs on an electronic device, the program The code is used to enable the electronic device to implement the image processing method provided by the embodiments of the present disclosure.

When the computer program is executed by the processor 1401, the above-mentioned functions defined in the system/device of the embodiment of the present disclosure are performed. According to embodiments of the present disclosure, the systems, apparatuses, modules, units, etc. described above may be implemented by computer program modules.

In one embodiment, the computer program may rely on a tangible storage medium such as an optical storage device, a magnetic storage device, or the like. In another embodiment, the computer program may also be transmitted, distributed in the form of a signal over a network medium, and downloaded and installed through the communication portion 1409, and/or installed from a removable medium 1411. The program code embodied by the computer program may be transmitted using any suitable network medium, including but not limited to: wireless, wired, etc., or any suitable combination of the foregoing.

According to the embodiments of the present disclosure, the program code for executing the computer program provided by the embodiments of the present disclosure may be written in any combination of one or more programming languages, and specifically, high-level procedures and/or object-oriented programming may be used. programming language, and/or assembly/machine language to implement these computational programs. Programming languages include, but are not limited to, languages such as Java, C++, python, "C" or similar programming languages. The program code may execute entirely on the user computing device, partly on the user device, partly on a remote computing device, or entirely on the remote computing device or server. In situations involving remote computing devices, the remote computing devices may be connected to the user computing device through any kind of network, including Local Area Networks (LANs) or Wide Area Networks (WANs), or may be connected to external A computing device (eg, connected via the Internet using an Internet service provider).

The flowchart and block diagrams in the Figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code that contains one or more logical functions for implementing the specified functions executable instructions. It should also be noted that, in some alternative implementations, the functions noted in the blocks may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It is also noted that each block of the block diagrams or flowchart illustrations, and combinations of blocks in the block diagrams or flowchart illustrations, can be implemented in special purpose hardware-based systems that perform the specified functions or operations, or can be implemented using A combination of dedicated hardware and computer instructions is implemented. Those skilled in the art will appreciate that various combinations and/or combinations of features recited in various embodiments and/or claims of the present disclosure are possible, even if such combinations or combinations are not expressly recited in the present disclosure. In particular, various combinations and/or combinations of the features recited in the various embodiments of the present disclosure and/or in the claims may be made without departing from the spirit and teachings of the present disclosure. All such combinations and/or combinations fall within the scope of this disclosure.

Embodiments of the present disclosure have been described above. However, these examples are for illustrative purposes only, and are not intended to limit the scope of the present disclosure. Although the various embodiments are described above separately, this does not mean that the measures in the various embodiments cannot be used in combination to advantage. The scope of the present disclosure is defined by the appended claims and their equivalents. Without departing from the scope of the present disclosure, those skilled in the art can make various substitutions and modifications, and these substitutions and modifications should all fall within the scope of the present disclosure.

Claims

An image processing method, comprising:

acquiring a target image, wherein the target image includes a target object and a non-target object;

Perform image segmentation processing and depth estimation processing on the target image to obtain a predicted segmentation map and a predicted depth map of the target image, respectively;

determining the position of the target object in the predicted depth map of the target image according to the predicted segmentation map of the target object; and

The predicted depth map is processed according to the position of the target object in the predicted depth map of the target image to obtain the predicted depth map of the target object.
The method according to claim 1, wherein, performing image segmentation processing and depth estimation processing on the target image to obtain a predicted segmentation map and a predicted depth map of the target image respectively, comprising:

Use an image processing model to process the target image, and obtain a predicted segmentation map and a predicted depth map of the target image, respectively, wherein the image processing model is obtained by training with training samples, wherein the training samples include sample images and The depth label and segmentation label of the sample image.
The method of claim 2, wherein the image processing model comprises a feature extraction network, an image segmentation network and a depth estimation network;

The described target image is processed by the image processing model, and the predicted segmentation map and the predicted depth map of the target image are obtained respectively, including:

Use the feature extraction network to process the target image to obtain a first intermediate feature map;

Using the image segmentation network to process the first intermediate feature map to obtain a second intermediate feature map;

Using the depth estimation network to process the first intermediate feature map to obtain a third intermediate feature map;

generating a fourth intermediate feature map according to the second intermediate feature map and the third intermediate feature map;

Using the depth estimation network to process the fourth intermediate feature map to obtain a predicted depth map of the target image; and

The second intermediate feature map is processed by the image segmentation network to obtain a predicted segmentation map of the target image.
The method according to claim 2, wherein the image processing model is obtained by training with training samples, comprising:

obtaining the training samples; and

Using the training samples to train a fully convolutional neural network model to obtain the image processing model.
The method of claim 4, wherein the fully convolutional neural network model comprises an initial feature extraction network, an initial image segmentation network and an initial depth estimation network;

Use the training samples to train a fully convolutional neural network model to obtain the image processing model, including:

Use the initial feature extraction network to process the sample image to obtain a fifth intermediate feature map;

Using the initial image segmentation network to process the fifth intermediate feature map to obtain a sixth intermediate feature map;

Using the initial depth estimation network to process the fifth intermediate feature map to obtain a seventh intermediate feature map;

generating an eighth intermediate feature map according to the sixth intermediate feature map and the seventh intermediate feature map;

Using the initial depth estimation network to process the eighth intermediate feature map to obtain a predicted depth map of the sample image;

Using the initial image segmentation network to process the sixth intermediate feature map to obtain a predicted segmentation map of the sample image;

Input the depth label, predicted depth map, segmentation label and predicted segmentation map of the sample image into the loss function of the fully convolutional neural network model, and adjust the network parameters of the fully convolutional neural network model according to the loss result, until the loss function converges; and

The trained fully convolutional neural network model is used as the image processing model.
The method according to claim 5, wherein, performing image segmentation processing on the sample image to obtain a segmentation label of the sample image, comprising:

Perform instance segmentation processing on the sample image to obtain an instance segmentation label of the sample image;

According to the instance segmentation label of the sample image, the semantic segmentation label of the sample image is obtained; and

The semantic segmentation label of the sample image is used as the segmentation label of the sample image.
The method according to claim 1, wherein the processing of the predicted depth map according to the position of the target object in the predicted depth map of the target image to obtain the predicted depth map of the target object comprises:

Setting pixel values of other positions in the predicted depth map of the target image other than the position of the target image as preset pixel values to obtain the predicted depth map of the target object.
The method according to any one of claims 1 to 7, wherein the image segmentation processing includes semantic segmentation processing or instance segmentation processing.
An image processing device, comprising:

an acquisition module for acquiring a target image, wherein the target image includes a target object and a non-target object;

a first processing module, configured to perform image segmentation processing and depth estimation processing on the target image, to obtain a predicted segmentation map and a predicted depth map of the target image, respectively;

a determining module for determining the position of the target object in the predicted depth map of the target image according to the predicted segmentation map of the target object; and

The second processing module is configured to process the predicted depth map according to the position of the target object in the predicted depth map of the target image to obtain the predicted depth map of the target object.
An electronic device comprising:

one or more processors;

memory for storing one or more programs,

Wherein, when the one or more programs are executed by the one or more processors, the one or more processors are caused to implement the method of any one of claims 1-8.
A computer-readable storage medium having executable instructions stored thereon, the instructions, when executed by a processor, cause the processor to implement the method of any one of claims 1-8.
A computer program product, the computer program product comprising a computer program for implementing the method according to any one of claims 1 to 8 when the computer program is executed by a processor.