CN114972465A

CN114972465A - Image target depth detection method and device, electronic equipment and storage medium

Info

Publication number: CN114972465A
Application number: CN202210615889.9A
Authority: CN
Inventors: 武鹏
Original assignee: Xiaomi Automobile Technology Co Ltd
Current assignee: Xiaomi Automobile Technology Co Ltd
Priority date: 2022-05-31
Filing date: 2022-05-31
Publication date: 2022-08-30

Abstract

The disclosure provides a method, a device, equipment and a storage medium for detecting image target depth, and relates to the technical field of automatic driving. The method comprises the following steps: acquiring first characteristic maps of a plurality of sizes of an image to be detected; extracting at least two first feature maps from the first feature maps with the plurality of sizes; weighting the characteristic values of the channels in each extracted first characteristic diagram based on a channel attention mechanism to generate a second characteristic diagram corresponding to each extracted first characteristic diagram; performing feature fusion on each second feature map to generate a third feature map; and carrying out depth detection on the third characteristic map to determine the target depth of the image to be detected. Therefore, the expression capability of the fused third feature map detail features is stronger, the accuracy of depth detection is improved, and the detection of the image target depth is more accurate and reliable.

Description

Image target depth detection method and device, electronic equipment and storage medium

Technical Field

The present disclosure relates to the field of automatic driving technologies, and in particular, to a method and an apparatus for detecting an image target depth, an electronic device, and a storage medium.

Background

In the automatic driving technology, the monocular vision-based target detection technology gradually enters various target detection scenes along with the development of the deep learning technology in recent years, but the defect that the target perception position is inaccurate due to inaccurate depth estimation of a target still exists.

Disclosure of Invention

The present disclosure is directed to solving, at least to some extent, one of the technical problems in the related art.

According to a first aspect of the present disclosure, a method for detecting an image target depth is provided, including:

acquiring first characteristic maps of a plurality of sizes of an image to be detected;

extracting at least two first feature maps from the first feature maps with the plurality of sizes;

weighting the characteristic values of the channels in each extracted first characteristic diagram based on a channel attention mechanism to generate a second characteristic diagram corresponding to each extracted first characteristic diagram;

performing feature fusion on each second feature map to generate a third feature map;

and performing depth detection on the third feature map to determine the target depth of the image to be detected.

According to a second aspect of the present disclosure, an apparatus for detecting an image target depth is provided, including:

the acquisition module is used for acquiring first characteristic maps of a plurality of sizes of an image to be detected;

the extraction module is used for extracting at least two first feature maps from the first feature maps with the plurality of sizes;

the first generation module is used for weighting the characteristic values of all channels in each extracted first characteristic diagram based on a channel attention mechanism so as to generate a second characteristic diagram corresponding to each extracted first characteristic diagram;

the second generation module is used for carrying out feature fusion on each second feature map so as to generate a third feature map;

and the determining module is used for carrying out depth detection on the third characteristic map so as to determine the target depth of the image to be detected.

An embodiment of a third aspect of the present disclosure provides a computer device, including: the present invention relates to a computer program product, and a computer program stored on a memory and executable on a processor, which when executed by the processor performs a method as set forth in an embodiment of the first aspect of the present disclosure.

A fourth aspect of the present disclosure is directed to a non-transitory computer-readable storage medium storing a computer program, which when executed by a processor implements the method as set forth in the first aspect of the present disclosure.

In a fifth aspect of the present disclosure, a vehicle is provided, which includes a computer device provided in the third aspect of the present disclosure.

The image target depth detection method, the image target depth detection device, the electronic equipment and the storage medium have the following beneficial effects:

in the embodiment of the disclosure, first feature maps of multiple sizes of an image to be detected are obtained, then at least two first feature maps are extracted from the first feature maps of multiple sizes, then feature values of each channel in each extracted first feature map are weighted based on a channel attention mechanism to generate a second feature map corresponding to each extracted first feature map, then feature fusion is performed on each second feature map to generate a third feature map, and finally depth detection is performed on the third feature map to determine a target depth of the image to be detected. Therefore, the characteristic values of all channels of the characteristic diagrams with different sizes can be weighted through the channel attention mechanism, the characteristic information of effective characteristic channels can be more highlighted, and the invalid characteristic information is reduced, so that the characteristic expression degree of the second characteristic diagram is improved, in addition, the sizes of the second characteristic diagram are different due to the fact that the sizes of the first characteristic diagram are different, through the characteristic fusion of the second characteristic diagrams with different sizes, the expression capability of the detail characteristics of the fused third characteristic diagram can be stronger, the accuracy of depth detection is improved, and the detection of the image target depth is more accurate and reliable.

Additional aspects and advantages of the disclosure will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the disclosure.

Drawings

The foregoing and/or additional aspects and advantages of the present disclosure will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings of which:

fig. 1 is a schematic flowchart illustrating a method for detecting an image target depth according to an embodiment of the present disclosure;

fig. 2 is a schematic flowchart of a method for detecting an image target depth according to another embodiment of the present disclosure;

fig. 3 is a schematic flowchart of a method for detecting an image target depth according to another embodiment of the present disclosure;

fig. 4 is a schematic structural diagram of an apparatus for detecting depth of an image target according to an embodiment of the present disclosure;

fig. 5 is a block diagram of an electronic device for implementing the method for detecting the depth of an image target according to the embodiment of the present disclosure.

Detailed Description

Reference will now be made in detail to the embodiments of the present disclosure, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or elements having the same or similar functions throughout. The embodiments described below with reference to the drawings are exemplary and intended to be illustrative of the present disclosure, and should not be construed as limiting the present disclosure.

The method, apparatus, electronic device, and storage medium for detecting the depth of an image target according to embodiments of the present disclosure are described below with reference to the accompanying drawings.

The method for detecting the depth of an image target provided by the present disclosure may be executed by the apparatus for detecting the depth of an image target provided by the present disclosure, and may also be executed by the electronic device provided by the present disclosure, where the electronic device may include, but is not limited to, a cloud device, a mobile device, a car server, and other hardware devices having various operating systems, touch screens, and/or display screens.

The detection method of the image target depth provided by the present disclosure is performed by the detection device of the image target depth provided by the present disclosure, which is hereinafter referred to as "device" for short.

Fig. 1 is a schematic flowchart of a method for detecting an image target depth according to an embodiment of the present disclosure.

As shown in fig. 1, the method for detecting the depth of the image target may include the following steps:

step 101, obtaining first characteristic maps of a plurality of sizes of an image to be detected.

The image to be detected can be used for detecting the object to be detected. For example, in some vehicle automatic driving scenes, an image captured by a camera in a vehicle may be used as an image to be detected, and a vehicle, a pedestrian, or an obstacle may be used as an object to be detected, which is not limited herein. The image capturing device may be a monocular 3D camera, which is not limited herein.

Or, in some scenes in which the terminal device photographs and recognizes the image, an image obtained by shooting with a camera in the terminal device may be used as an image to be detected, and a face, a scenery or an animal may be used as a target to be detected.

The first feature map may be an image containing feature information obtained by performing feature recognition on the image to be detected, and may include features of each dimension of the object to be detected, such as a position coordinate, a heading angle, a depth of field, a size of the object, and the like, which are not limited herein. In the present disclosure, a plurality of first feature maps may be acquired, and the size of each first feature map may be different.

It should be noted that each first feature map has a certain corresponding dimension, that is, the length, the width, and the height of the first feature map.

For example, H, W, C represents the length, width, and height of the first feature map, respectively. If there are 3 first feature maps currently acquired. Wherein, the sizes of the 3 first feature maps may be [ H: 3, W: 4, C:2], [ H: 3, W: 4, C:4], [ H: 3, W: 4, C:6], that is, the size of the acquired first feature map may be different, and is not limited herein.

Optionally, the device may input the image to be detected to a pre-established feature pyramid neural network to obtain a first feature map output by a plurality of network layers of the feature pyramid neural network.

The Feature Pyramid (FPN) neural Network is a convolutional neural Network, and can divide an image to be detected into a plurality of images with different sizes to be independently calculated.

It should be noted that the feature pyramid has a plurality of different network layers for predicting features, and thus the first feature map (feature map) output by each network layer is also different.

Step 102, at least two first feature maps are extracted from the first feature maps with a plurality of sizes.

As a possible implementation manner, at least two first feature maps can be extracted from large to small based on the size order of the feature maps.

And 103, weighting the characteristic values of the channels in each extracted first characteristic diagram based on a channel attention mechanism to generate a second characteristic diagram corresponding to each extracted first characteristic diagram.

The second feature map may be a feature map obtained after the first feature map is weighted by the channel feature value.

Optionally, each extracted first feature map may be sequentially input to the averaging pooling layer, the full connection layer, and the Sigmoid activation layer according to a channel attention mechanism to output a weight coefficient corresponding to each input first feature map, and then feature point multiplication is performed on the weight coefficient and a feature value of each channel in each extracted first feature map, so as to generate a second feature map corresponding to each extracted first feature map.

It should be noted that the characteristic value of each channel may be the same or different. And the importance degree of each channel is different, and the first feature graphs of different levels and different resolutions of the feature pyramid are weighted by using a feature channel attention mechanism, so that better expression features can be obtained, and the overall performance of the model is improved.

As another possible implementation manner, each network layer of a lower layer may be selected from each network layer of the feature pyramid network, and a network layer structure determined according to a channel feature attention mechanism is added after (at least two) network layers of the lower layer, so that feature weighting may be performed on convolution features (first feature maps) from different depths and different spatial resolutions, thereby improving the characterization capability of the model features.

And 104, performing feature fusion on each second feature map to generate a third feature map.

And the third characteristic diagram is a characteristic diagram obtained after the second characteristic diagrams are fused.

It should be noted that, since the sizes of the different second feature maps are different, the resolution and semantic information are also different. It can be understood that the second feature map with the smaller size has a larger receptive field in the convolutional neural network, and the semantic information representation capability is strong, while the geometric detail information representation capability of the second feature map with the larger size is strong, and the resolution is high. By carrying out feature fusion on the second feature maps with different sizes and different resolutions, the generated third feature map can contain more position and detail information, and good data support is provided for the subsequent depth detection.

And 105, performing depth detection on the third characteristic map to determine the target depth of the image to be detected.

The target depth may be an estimated depth of the target to be detected, that is, a distance between the target to be detected and the shooting point.

It should be noted that distances between different pixel points corresponding to the object to be detected and the shooting point may be the same or different. It will be appreciated that by determining the depth of the target, the degree of perception of the target may be increased, as well as the accuracy of perception of the position of the target.

Optionally, the third feature map may be input into the trained depth prediction model to determine a target depth of the image to be detected. Alternatively, the third feature map may be input into another detection model to determine a target position, a target heading angle, a target size, a target type, and the like of the target to be detected, which is not limited herein.

It should be noted that, because the third feature map fuses feature information of the second feature maps with different scales, the third feature map can well improve accuracy and reliability of target detection, and is beneficial to obtaining accurate target depth.

In the embodiment of the disclosure, first feature maps of multiple sizes of an image to be detected are obtained, then at least two first feature maps are extracted from the first feature maps of multiple sizes, then feature values of each channel in each extracted first feature map are weighted based on a channel attention mechanism to generate a second feature map corresponding to each extracted first feature map, then feature fusion is performed on each second feature map to generate a third feature map, and finally depth detection is performed on the third feature map to determine a target depth of the image to be detected. Therefore, the characteristic values of all channels of the characteristic diagrams with different sizes are weighted through the channel attention mechanism, the characteristic information of effective characteristic channels can be more highlighted, the invalid characteristic information is reduced, the characteristic expression degree of the second characteristic diagram is improved, the sizes of the second characteristic diagram are different due to the fact that the first characteristic diagram is different in size, and through feature fusion of the characteristic diagrams with different sizes, the expression capability of the detailed characteristics of the fused third characteristic diagram can be stronger, the accuracy of depth detection is improved, detection of the image target depth is more accurate and reliable, and the perception degree of the target position is further improved.

Fig. 2 is a schematic flowchart of a method for detecting an image target depth according to another embodiment of the present disclosure.

As shown in fig. 2, the method for detecting the depth of the image target may include the following steps:

step 201, acquiring first characteristic maps of a plurality of sizes of an image to be detected.

At step 202, at least two first feature maps are extracted from the first feature maps with multiple sizes.

It should be noted that, for specific implementation manners of

steps

201 and 202, reference may be made to the above embodiments, and details are not described herein.

Step 203, obtaining a feature vector corresponding to each feature channel in each extracted first feature map.

Specifically, the apparatus may first input each extracted first feature map into the average pooling layer, so as to convert the first feature map with the size of HxWxC into a feature map of 1x1xC, that is, so that the feature map of 1x1xC may represent global information of the first feature map, and then may send the feature map of 1x1xC into the full-concatenation layer, so as to obtain a C-dimensional feature vector.

H, W, C may be the length, width, height of the first characteristic diagram, respectively, and is not limited herein.

Optionally, the fully-connected layer may be two layers, one layer is used to reduce the dimensionality of the feature map, then activation is performed based on the relu activation function, the activated result is sent to the second layer, and then the dimensionality of the feature vector is restored to the C-dimensional feature vector.

It should be noted that the first feature map may include a plurality of feature channels, and thus the feature vector corresponding to each feature channel may also be different.

And 204, calculating a weight coefficient corresponding to each characteristic channel according to each characteristic vector.

As a possible implementation manner, the feature vector may be input into a sigmoid function (activation function) to compress the feature vector, so as to limit each element in the feature vector to a range from 0 to 1, and then a compressed vector value may be determined as a weight coefficient corresponding to a feature channel of the feature vector. It should be noted that the weighting coefficients of different feature channels may be the same or different.

The sigmoid function is an activation function, and can squeeze a large range of values into a small interval. It should be noted that, by calculating the weight of each feature channel, the importance of each feature channel can be measured, so that the weight coefficient corresponding to the important feature channel is larger.

And step 205, multiplying each weight coefficient by the characteristic value corresponding to each characteristic channel to generate a second characteristic map corresponding to each extracted first characteristic map.

The second feature map may be a feature map obtained after the first feature map is weighted by the feature channel.

For example, if there are 4 feature channels, and the corresponding feature values are x, y, z, and w, and the weight coefficients of the four feature channels are 0.2, 0.6, 0.15, and 0.15, each weight coefficient can be multiplied by the corresponding feature value of each feature channel, that is, 0.2x, 0.6y, 0.15z, and 0.15w can be obtained, which is not limited herein.

By multiplying the feature value of each feature channel by the corresponding weight coefficient, the feature information of the useful feature channel can be effectively highlighted, and the interference of invalid information can be suppressed, so that each feature of the obtained second feature map is more effective.

And step 206, performing feature fusion on each second feature map to generate a third feature map.

And step 207, performing depth detection on the third feature map to determine the target depth of the image to be detected.

It should be noted that, for specific implementation manners of

steps

206 and 207, reference may be made to the foregoing embodiments, which are not described herein again.

In the embodiment of the disclosure, first feature maps of multiple sizes of an image to be detected are obtained, then at least two first feature maps are extracted from the first feature maps of multiple sizes, then feature vectors corresponding to each feature channel in each extracted first feature map are obtained, then a weight coefficient corresponding to each feature channel is calculated according to each feature vector, then each weight coefficient is multiplied by a feature value corresponding to each feature channel to generate a second feature map corresponding to each extracted first feature map, then feature fusion is performed on each second feature map to generate a third feature map, and finally depth detection is performed on the third feature map to determine a target depth of the image to be detected. Therefore, by introducing a channel feature attention mechanism, useful feature channels can be highlighted, the weight of useless feature channels is reduced, and the obtained features are more effective.

Fig. 3 is a schematic flowchart of a method for detecting an image target depth according to another embodiment of the present disclosure.

As shown in fig. 3, the method for detecting the depth of the image target may include the following steps:

step 301, obtaining first feature maps of multiple sizes of an image to be detected.

Step 302, at least two first feature maps are extracted from the first feature maps with the plurality of sizes.

And step 303, weighting the feature values of the channels in each extracted first feature map based on a channel attention mechanism to generate a second feature map corresponding to each extracted first feature map.

And step 304, performing feature fusion on each second feature map to generate a third feature map.

It should be noted that, for specific implementation manners of

steps

301, 302, 303, and 304, reference may be made to the foregoing embodiments, which are not described herein again.

Step 305, obtaining a plurality of training images and a labeled depth value corresponding to each pixel point in each training image.

The training images may be images collected in advance and including the same scene or images collected in different scenes, and each scene may include one target to be detected or a plurality of targets to be detected. The target to be detected in each training image may be the same or different. The depth label value may be used to indicate an actual depth of field, i.e., a true depth (GT), of each pixel.

The training image may be used to train the initial neural network, so that the initial neural network model may become a usable depth prediction model.

Step 306, predicting the probability that each pixel point falls into each preset depth range.

Specifically, a plurality of training images may be input into a pre-constructed initial neural network, so as to perform probability prediction on each pixel point falling into each preset depth range.

Optionally, the depth range sensed by the algorithm may be divided in advance, so that the depth range may be divided into a plurality of depth ranges on average. For example, if the maximum depth range perceived by the algorithm is m, m may be divided into m parts, which is not limited herein.

It should be noted that the probabilities of different pixel points falling into the same depth range may be the same or different. And the probability that each pixel point falls into each depth range can be the same or different.

And 307, determining the sum of the products of the probability of each pixel point falling into each depth range and the depth value corresponding to the depth range as the predicted depth corresponding to each pixel point.

The predicted depth may be a depth estimation value for a pixel point.

For example, if the depth range is [2, 4], 3 may be determined as the depth value corresponding to the depth range.

Optionally, the predicted depth corresponding to each pixel point may be calculated by the following formula:

wherein pred _ d is the predicted depth, probi is the probability value of the pixel point falling into the ith depth range, n is the number of the depth ranges, and depthnint _ i is the depth value corresponding to the ith depth range.

For example, if there are 4 current pixels, which are A, B, C, D, the depth values corresponding to the depth ranges a, b, c, d are 2, 4, 6, and 8, respectively. Wherein the probability that A falls into the depth ranges a, b, c and d is 0.4, 0.3, 0.2 and 0.1 respectively; the probability that B falls into the depth ranges a, B, c and d is 0.1, 0.2, 0.3 and 0.4 respectively; the probability that C falls into the depth ranges a, b, C and d is 0.2, 0.3, 0.4 and 0.1 respectively; the probabilities that D falls into the depth ranges a, B, C, and D are 0.1, 0.4, 0.3, and 0.2, respectively, the predicted depth value of the pixel a is 0.4x2+0.3x4+0.2x6+0.1x8 equals 4, the predicted depth value of the pixel a is 0.1x2+0.2x4+0.3x6+0.4x8 equals 6, the predicted depth value of the pixel B is 0.4x2+0.3x4+0.2x6+0.1x8 equals 4, the predicted depth value of the pixel C is 0.2x2+0.3x4+0.4x6+0.1x8 equals 4.8, and the predicted depth value of the pixel D is 0.1x2+0.4x4+0.3x6+0.2x 632 x8, which is not limited.

And 308, correcting the initial neural network according to the difference between the prediction depth and the labeled depth value corresponding to each pixel point to generate a depth prediction model.

As a possible implementation manner, the difference between the predicted depth and the labeled depth corresponding to each pixel point may be calculated by the following formula:

and Ldepth is the difference between the predicted depth and the labeled depth value, m is the number of the pixel points in the second characteristic diagram, pred _ dm is the predicted depth value of the mth pixel point, and GT _ dm is the labeled depth value of the mth pixel point. Wherein k is a network uncertainty head estimation value, that is, a preset adjustment factor.

Specifically, the device may determine a correction gradient of the current initial neural network by a gradient descent method, that is, according to a difference between the predicted depth and the labeled depth corresponding to each pixel point, and then adjust each layer parameter of the initial neural network according to the correction gradient. And then, iteratively updating the initial neural network until the difference between the prediction depth and the labeled depth value corresponding to each pixel point is smaller than a preset threshold value, which indicates that the current initial neural network is an available model, i.e., the initial neural network can be used as a trained depth prediction model.

And 309, inputting the third feature map into the trained depth prediction model to obtain the target depth of the image to be detected.

It should be noted that, since the depth prediction model is a trained neural network model, after the third feature map is input into the depth prediction model, the depth of each pixel point corresponding to the third feature map, that is, the target depth corresponding to the image to be detected, may be calculated based on the depth prediction model.

In the embodiment of the disclosure, first feature maps of multiple sizes of an image to be detected are obtained, at least two first feature maps are extracted from the first feature maps of multiple sizes, then feature values of each channel in each extracted first feature map are weighted based on a channel attention mechanism to generate a second feature map corresponding to each extracted first feature map, then each second feature map is subjected to feature fusion to generate a third feature map, then a plurality of training images and labeled depth values corresponding to each pixel point in each training image are obtained, then the probability that each pixel point falls into each preset depth range is predicted, then the sum of products of the probability that each pixel point falls into each depth range and the depth value corresponding to a depth interval is determined as a predicted depth corresponding to each pixel point, and then a difference between the predicted depth corresponding to each pixel point and the labeled depth value is determined according to the difference between the predicted depth corresponding to each pixel point and the labeled depth value, and correcting the initial neural network to generate a depth prediction model, and then inputting the third feature map into the trained depth prediction model to obtain the target depth of the image to be detected. Therefore, the initial neural network is trained, so that the prediction depth in the initial neural network can better accord with the real depth, the learning difficulty of the neural network is reduced, and the precision of the depth prediction model in depth prediction is improved.

Fig. 4 is a schematic structural diagram of an apparatus for detecting image target depth according to an embodiment of the present disclosure.

As shown in fig. 4, the apparatus 400 for detecting the depth of the image target may include: an acquisition module 410, an extraction module 420, a first generation module 430, a second generation module 440, and a determination module 450.

An obtaining module 410, configured to obtain first feature maps of multiple sizes of an image to be detected;

an extracting module 420, configured to extract at least two first feature maps from the first feature maps of the plurality of sizes;

a first generating module 430, configured to weight feature values of each channel in each extracted first feature map based on a channel attention mechanism, so as to generate a second feature map corresponding to each extracted first feature map;

a second generating module 440, configured to perform feature fusion on each second feature map to generate a third feature map;

a determining module 450, configured to perform depth detection on the third feature map to determine a target depth of the image to be detected.

Optionally, the obtaining module is specifically configured to:

inputting an image to be detected into a pre-established characteristic pyramid neural network so as to obtain a first characteristic diagram output by a plurality of network layers of the characteristic pyramid neural network.

Optionally, the first generating module is specifically configured to

Acquiring a feature vector corresponding to each feature channel in each extracted first feature map;

calculating a weight coefficient corresponding to each characteristic channel according to each characteristic vector;

and multiplying each weight coefficient by the characteristic value corresponding to each characteristic channel to generate a second characteristic map corresponding to each extracted first characteristic map.

Optionally, the determining module includes:

and the acquisition unit is used for inputting the third feature map into the trained depth prediction model so as to acquire the target depth of the image to be detected.

Optionally, the obtaining unit is further configured to:

acquiring a plurality of training images and a labeled depth value corresponding to each pixel point in each training image;

predicting the probability that each pixel point falls into each preset depth range;

determining the sum of the products of the probability of each pixel point falling into each depth range and the depth value corresponding to the depth interval as the predicted depth corresponding to each pixel point;

and correcting the initial neural network according to the difference between the predicted depth corresponding to each pixel point and the labeled depth value to generate a depth prediction model.

In the embodiment of the disclosure, first feature maps of multiple sizes of an image to be detected are obtained first, then at least two first feature maps are extracted from the first feature maps of multiple sizes, then feature values of each channel in each extracted first feature map are weighted based on a channel attention mechanism to generate a second feature map corresponding to each extracted first feature map, then feature fusion is performed on each second feature map to generate a third feature map, and finally depth detection is performed on the third feature map to determine a target depth of the image to be detected. Therefore, the characteristic values of all channels of the characteristic diagrams with different sizes are weighted through the channel attention mechanism, the characteristic information of effective characteristic channels can be more highlighted, the invalid characteristic information is reduced, the characteristic expression degree of the second characteristic diagram is improved, the sizes of the second characteristic diagram are different due to the fact that the first characteristic diagram is different in size, and through feature fusion of the characteristic diagrams with different sizes, the expression capability of the detailed characteristics of the fused third characteristic diagram can be stronger, the accuracy of depth detection is improved, and detection of the image target depth is more accurate and reliable.

The present disclosure also provides an electronic device, a readable storage medium, and a computer program product according to embodiments of the present disclosure.

FIG. 5 illustrates a schematic block diagram of an example electronic device 500 that can be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the disclosure described and/or claimed herein.

As shown in fig. 5, the device 500 comprises a computing unit 501 which may perform various suitable actions and processes according to a computer program stored in a read only memory 502 or a computer program loaded from a storage unit 508 into a random access memory 503. In the random access memory 503, various programs and data required for the operation of the device 500 can also be stored. The computing unit 501, the read only memory 502 and the random access memory 503 are connected to each other by a bus 504. An input/output interface 505 is also connected to bus 504.

A number of components in the device 500 are connected to the input/output interface 505, including: an input unit 506 such as a keyboard, a mouse, or the like; an output unit 507 such as various types of displays, speakers, and the like; a storage unit 508, such as a magnetic disk, optical disk, or the like; and a communication unit 509 such as a network card, modem, wireless communication transceiver, etc. The communication unit 509 allows the device 500 to exchange information/data with other devices through a computer network such as the internet and/or various telecommunication networks.

The computing unit 501 may be a variety of general-purpose and/or special-purpose processing components having processing and computing capabilities. Some examples of the computing unit 501 include, but are not limited to, a central processing unit, a graphics processing unit, various dedicated artificial intelligence computing chips, various computing units running machine learning model algorithms, a digital signal processor, and any suitable processor, controller, microcontroller, or the like. The calculation unit 501 performs the respective methods and processes described above, such as a training method of a detection model of an image target depth. For example, in some embodiments, the training method of the detection model of image target depth may be implemented as a computer software program tangibly embodied in a machine-readable medium, such as storage unit 508. In some embodiments, part or all of the computer program may be loaded and/or installed onto device 500 via read only memory 502 and/or communications unit 509. When the computer program is loaded into the random access memory 503 and executed by the calculation unit 501, one or more steps of the above described method of training a detection model of image target depths may be performed. Alternatively, in other embodiments, the calculation unit 501 may be configured by any other suitable means (e.g. by means of firmware) to perform a training method of the detection model of the image target depth.

Various implementations of the systems and techniques described here above may be realized in digital electronic circuitry, integrated circuitry, field programmable gate arrays, application specific integrated circuits, application specific standard products, systems on a chip, load programmable logic devices, computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.

Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program codes, when executed by the processor or controller, cause the functions/operations specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a random access memory, a read-only memory, an erasable programmable read-only memory, an optical fiber, a portable compact disc read-only memory, an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a cathode ray tube) or a liquid crystal display for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local area networks, wide area networks, the internet, and blockchain networks.

The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The Server can be a cloud Server, also called a cloud computing Server or a cloud host, and is a host product in a cloud computing service system, so as to solve the defects of high management difficulty and weak service expansibility in the traditional physical host and VPS service ("Virtual Private Server", or simply "VPS"). The server may also be a server of a distributed system, or a server incorporating a blockchain.

It should be understood that various forms of the flows shown above, reordering, adding or deleting steps, may be used. For example, the steps described in the present disclosure may be executed in parallel or sequentially or in different orders, and are not limited herein as long as the desired results of the technical solutions disclosed in the present disclosure can be achieved.

The above detailed description should not be construed as limiting the scope of the disclosure. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made, depending on design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present disclosure should be included in the scope of protection of the present disclosure.

Claims

1. A method for detecting the depth of an image target is characterized by comprising the following steps:

and carrying out depth detection on the third characteristic map to determine the target depth of the image to be detected.

2. The method of claim 1, wherein the obtaining a first feature map of a plurality of sizes of an image to be detected comprises:

3. The method according to claim 1, wherein the weighting the feature values of the channels in each extracted first feature map based on the channel attention mechanism to generate the second feature map corresponding to each extracted first feature map comprises:

4. The method according to claim 1, wherein the depth detection of the third feature map to determine the target depth of the image to be detected comprises:

and inputting the third feature map into a trained depth prediction model to obtain the target depth of the image to be detected.

5. The method of claim 4, further comprising, prior to said inputting the third feature map into the trained depth prediction model:

6. An apparatus for detecting depth of an image object, comprising:

7. The apparatus of claim 6, wherein the obtaining module is specifically configured to:

8. The apparatus of claim 6, wherein the first generating means is specifically configured to

9. The apparatus of claim 6, wherein the determining module comprises:

10. The apparatus of claim 9, wherein the obtaining unit is further configured to:

11. An electronic device, comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-5.

12. A non-transitory computer readable storage medium having stored thereon computer instructions for causing the computer to perform the method of any one of claims 1-5.

13. A vehicle characterized in that it comprises an electronic device according to claim 11.