WO2023245321A1 - Image depth prediction method and apparatus, device, and storage medium - Google Patents

Image depth prediction method and apparatus, device, and storage medium Download PDF

Info

Publication number
WO2023245321A1
WO2023245321A1 PCT/CN2022/099713 CN2022099713W WO2023245321A1 WO 2023245321 A1 WO2023245321 A1 WO 2023245321A1 CN 2022099713 W CN2022099713 W CN 2022099713W WO 2023245321 A1 WO2023245321 A1 WO 2023245321A1
Authority
WO
WIPO (PCT)
Prior art keywords
depth
loss function
training
image
gradient
Prior art date
Application number
PCT/CN2022/099713
Other languages
French (fr)
Chinese (zh)
Inventor
倪鹏程
张亚森
苏海军
陈凌颖
Original Assignee
北京小米移动软件有限公司
北京小米松果电子有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京小米移动软件有限公司, 北京小米松果电子有限公司 filed Critical 北京小米移动软件有限公司
Priority to PCT/CN2022/099713 priority Critical patent/WO2023245321A1/en
Priority to CN202280004623.9A priority patent/CN117616457A/en
Publication of WO2023245321A1 publication Critical patent/WO2023245321A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/50Depth or shape recovery

Definitions

  • the present disclosure relates to the field of computer vision technology, and in particular, to an image depth prediction method, device, equipment and storage medium.
  • the terminal device needs to estimate the depth of a picture to complete subsequent tasks. For example, when people take pictures, they often want to achieve the blur produced by the camera when taking pictures. However, the lens on the mobile phone cannot be compared with the camera lens. At this time, the mobile phone needs to estimate the depth of the picture taken, so that the areas of different depths in the picture can be blurred accordingly. Or the vehicle-mounted device can realize autonomous driving situation perception by acquiring images and performing depth estimation.
  • the depth estimation scheme used usually uses a network with a large number of parameters and a large amount of calculation.
  • Such solutions are often deployed on large-scale computing equipment, such as large-scale service equipment, service clusters, etc. Obviously, such a solution cannot be adapted to terminal equipment.
  • the present disclosure provides an image depth prediction method, device, equipment and storage medium.
  • an image depth prediction method includes: acquiring an image to be processed; inputting the image to be processed into a deep network model, and predicting the image depth of the image to be processed.
  • the deep network model is composed of multiple It consists of layer depth separable convolutions; among them, the deep network model is trained using a target loss function, and the target loss function is determined based on at least one of the error weight parameter, depth gradient error, and depth structure loss parameter; the error weight parameter is used to characterize the estimation The weight of the difference between depth and label depth.
  • the depth gradient error is used to characterize the gradient difference between the estimated depth and the label depth.
  • the depth structure loss parameter is used to characterize the difference in label depth corresponding to different positions in the training image; the estimated depth is the depth of the training stage.
  • the network model is determined based on the training images, and the label depth corresponds to the training images.
  • the target loss function is determined in the following manner: the target loss function is determined based on at least one of the first loss function, the second loss function, and the third loss function; wherein the first loss function is based on the error weight The parameters are determined, the second loss function is determined based on the depth gradient error, and the third loss function is determined based on the depth structure loss parameters.
  • the first loss function is determined in the following manner: based on the estimated depth and the label depth, the absolute value of the error between the estimated depth and the label depth is determined; based on the absolute value of the error and the error weight parameter, the first loss function is determined .
  • the depth gradient error is determined in the following manner: based on the estimated depth and a preset gradient function, the estimated depth gradient in at least two directions is determined; based on the label depth and the gradient function, the estimated depth gradient in at least two directions is determined. the label depth gradient; determine the depth gradient error in at least two directions according to the estimated depth gradient in at least two directions and the label depth gradient in at least two directions; the second loss function is determined in the following way: Depth gradient error, determine the second loss function.
  • the depth structure loss parameters are determined in the following manner: multiple training data groups are determined, wherein the training data group contains at least two pixels, at least two pixels are pixels of the training image, and the label depth includes Training labels corresponding to at least two pixels in the training image; for each training data group, determine the depth structure loss parameters based on the depth of the labels corresponding to at least two pixels; the third loss function is determined in the following way: for each training The data set determines the third loss function based on the depth structure loss parameter and the estimated depth corresponding to at least two pixel points.
  • determining multiple training data groups includes: using a preset gradient function to determine the gradient boundary of the training image; sampling the training image according to the gradient boundary to determine multiple training data groups.
  • the training images are determined in the following manner: determine at least one training image set in any scene, and the images in the training image set are images after mask processing for the set area; At least one training image set determines the training image.
  • an image depth prediction device includes: an acquisition module for acquiring an image to be processed; a prediction module for inputting the image to be processed into a deep network model and predicting the image to be processed.
  • the image depth of the image, the deep network model is composed of multi-layer depth separable convolutions; among them, the deep network model is trained using a target loss function, and the target loss function is based on at least one of the error weight parameter, the depth gradient error, and the depth structure loss parameter.
  • the term is determined; the error weight parameter is used to characterize the weight of the difference between the estimated depth and the label depth, the depth gradient error is used to characterize the gradient difference between the estimated depth and the label depth, and the depth structure loss parameter is used to characterize the weight corresponding to different positions in the training image.
  • Label depth difference the estimated depth is determined by the deep network model in the training stage based on the training image, and the label depth corresponds to the training image.
  • the target loss function is determined in the following manner: the target loss function is determined based on at least one of the first loss function, the second loss function, and the third loss function; wherein the first loss function is based on the error weight The parameters are determined, the second loss function is determined based on the depth gradient error, and the third loss function is determined based on the depth structure loss parameters.
  • the first loss function is determined in the following manner: based on the estimated depth and the label depth, the absolute value of the error between the estimated depth and the label depth is determined; based on the absolute value of the error and the error weight parameter, the first loss function is determined .
  • the depth gradient error is determined in the following manner: based on the estimated depth and a preset gradient function, the estimated depth gradient in at least two directions is determined; based on the label depth and the gradient function, the estimated depth gradient in at least two directions is determined. the label depth gradient; determine the depth gradient error in at least two directions according to the estimated depth gradient in at least two directions and the label depth gradient in at least two directions; the second loss function is determined in the following way: Depth gradient error, determine the second loss function.
  • the depth structure loss parameters are determined in the following manner: multiple training data groups are determined, wherein the training data group contains at least two pixels, at least two pixels are pixels of the training image, and the label depth includes Training labels corresponding to at least two pixels in the training image; for each training data group, determine the depth structure loss parameters based on the depth of the labels corresponding to at least two pixels; the third loss function is determined in the following way: for each training The data set determines the third loss function based on the depth structure loss parameter and the estimated depth corresponding to at least two pixel points.
  • determining multiple training data groups includes: using a preset gradient function to determine the gradient boundary of the training image; sampling the training image according to the gradient boundary to determine multiple training data groups.
  • the training images are determined in the following manner: determine at least one training image set in any scene, and the images in the training image set are images after mask processing for the set area; At least one training image set determines the training image.
  • an image depth prediction device including: a processor; a memory for storing instructions executable by the processor; wherein the processor is configured to perform the first aspect or the first aspect method in any embodiment.
  • a non-transitory computer-readable storage medium which when instructions in the storage medium are executed by a processor of a computer, enables the computer to execute the first aspect or any one of the first aspects. methods described in the embodiments.
  • the adopted deep network model has depth-separable convolution, thereby realizing the deployment of the deep network model on the terminal device, effectively saving the running time, and avoiding the need for large-scale models. It takes a long time to run on the terminal device and is not suitable.
  • Figure 1 is a schematic diagram of a scene according to an exemplary embodiment.
  • Figure 2 is a schematic diagram of a blurred image based on depth prediction according to an exemplary embodiment.
  • Figure 3 is a schematic diagram of a fully supervised depth estimation process according to an exemplary embodiment.
  • Figure 4 is a schematic structural diagram of a deep network model according to an exemplary embodiment.
  • Figure 5 is a schematic structural diagram of a semantic perception unit according to an exemplary embodiment.
  • Figure 6 is a flow chart of an image depth prediction method according to an exemplary embodiment.
  • Figure 7 is a flow chart of a deep network model training method according to an exemplary embodiment.
  • Figure 8 is a flow chart of a method for determining a target loss function according to an exemplary embodiment.
  • Figure 9 is a flow chart of a method for determining a first loss function according to an exemplary embodiment.
  • Figure 10 is a flow chart of a method for determining a second loss function according to an exemplary embodiment.
  • Figure 11 is a flow chart of a method for determining a third loss function according to an exemplary embodiment.
  • Figure 12 is a schematic diagram of a depth map according to an exemplary embodiment.
  • Figure 13 is a schematic diagram of depth structure sampling according to an exemplary embodiment.
  • Figure 14 is a schematic diagram of random sampling according to an exemplary embodiment.
  • Figure 15 is a schematic diagram of an image depth prediction device according to an exemplary embodiment.
  • Figure 16 is a schematic diagram of an image depth prediction device according to an exemplary embodiment.
  • Figure 17 is a schematic diagram of a deep network model training device according to an exemplary embodiment.
  • the methods involved in the present disclosure can be applied to scenes of depth prediction and estimation of images.
  • users take selfies with their mobile phones and often hope that their characters in the images will be clearer, while the background can be Perform appropriate blurring.
  • the user wants is to process areas with different depths in the image accordingly. For example, areas with smaller depth, that is, the character area, can remain clear, while areas with greater depth, such as the background area, can be blurred accordingly. deal with.
  • the clarity of the image at the shooting location is related to the unique design of the lens.
  • the focal plane of the captured image is clear but the sides are blurred.
  • the blurring effect of images captured by mobile phone lenses cannot achieve the blurring effect of images captured by cameras.
  • some mobile phones can perform depth prediction on different areas in the image and layer the image based on the predicted depth information as guidance information.
  • the mobile phone can use different blur convolution kernels to blur different image layers, and finally fuse the blur results of different layers to obtain the final blurred image.
  • FIG. 2 is a schematic diagram of a blurred image based on depth prediction according to an exemplary embodiment. It can be seen that, for example, users use mobile phones to take selfies in certain complex environments. At this time, it is necessary to perform depth prediction on the image captured by the mobile phone (ie, the leftmost image in Figure 2), and then the image can be layered based on the predicted depth information. For different layers, blur convolutions of different layers can be used for blurring, and finally the blurring results of different layers are fused to obtain the final blurred image (i.e., the rightmost image in Figure 2).
  • depth prediction solutions based on monocular cameras can use self-supervised depth estimation. It can be understood that depth prediction can also be called depth estimation.
  • depth prediction can also be called depth estimation.
  • a depth estimation network and a relative pose estimation network can be constructed, and continuous video frames are used as network input. During the training process, the depth map and relative pose relationship of these frames can be estimated through calculation. Then, the mutual mapping relationship between the 3D (dimension) and 2D scenes can be used to minimize the photometric reconstruction error, thereby optimizing the depth map and relative pose relationship.
  • only the trained depth estimation network is used to predict the depth of the input video and obtain the corresponding depth map.
  • monocular depth (MonoDepth) can be used as a basis to develop prediction of depth information from video sequences.
  • fully supervised depth estimation can be adopted, such as building a deep network that uses paired color pictures and depth pictures (i.e. labels) as input.
  • the trained deep network is obtained. For example, taking Big to Small (BTS) as a representative, input paired color images and depth images (i.e. labels) to complete the network training task.
  • BTS Big to Small
  • the method can be to input a color picture, and obtain the depth estimation result through the feature encoder, feature aggregation network, and feature decoder.
  • the depth estimation result will be compared with the real data (ground truth, GT) depth to calculate the structural similarity index (SSIM) loss, and the parameters will be updated.
  • a color picture can be input, and then the feature map F of the input picture can be obtained after convolution.
  • the feature map F Based on this feature map F, we can perform convolution from different dimensions through operations such as full-image encoder, convolution, and atrous spatial pyramid pooling (ASPP) to obtain different dimensions. deep features.
  • the depth features obtained by convolution of different dimensions can be superimposed and convolved again, such as convolution x in Figure 3.
  • the feature y obtained by convolution x can be used to determine the depth map of each different level using ordinal regression.
  • the depth map of different depth intervals can be determined.
  • the depth interval can be divided into L">l 0 , L”>l 1 ,..., L”>l k-1 and so on.
  • L >l k-1
  • monocular cameras and structured light can also be used.
  • structured light can be, for example, laser radar, millimeter wave radar, etc.
  • the monocular camera and structured light can be fixed using a rigid fixation method. Then, the monocular camera and structured light can be calibrated in the early stage, and then the depth result of the corresponding pixel position can be obtained through the triangulation ranging principle.
  • the present disclosure provides an image depth prediction method that uses a deep network model composed of multi-layer depth-separable convolutions to perform depth prediction on input images.
  • the deep network model is trained using a target loss function, which is determined based on at least one of an error weight parameter, a depth gradient error, and a depth structure loss parameter.
  • the error weight parameter is used to characterize the weight of the difference between the estimated depth and the label depth.
  • the depth gradient error is used to characterize the gradient difference between the estimated depth and the label depth.
  • the depth structure loss parameter is used to characterize the difference in label depth corresponding to different positions in the training image. .
  • the estimated depth is determined by the deep network model in the training stage based on the training image, and the label depth corresponds to the training image.
  • the deep network model used in this disclosure has depth-separable convolution, thereby realizing the deployment of the deep network model on the terminal device, effectively saving the running time, and avoiding the long-running and unsuitable situation of large-scale models on the terminal device. .
  • Figure 4 is a schematic structural diagram of a deep network model according to an exemplary embodiment.
  • the structure of the deep network model may include at least one encoder (encoder), a semantic perception unit and at least one decoder (decoder).
  • the network structure can also be a trained deep network model.
  • it can also be an untrained deep network model.
  • the structure of an untrained deep network model is the same as that of a trained deep network model, and training does not change the model structure.
  • the semantic perception unit can include multiple layers of depth-separable convolutions. Therefore, compared with related technologies, the computing efficiency of ASPP will be higher and the redundancy of the network structure will be smaller.
  • the semantic perception unit of the present disclosure can also be considered as a more efficient ASPP, for example, it can be called efficient atrous spatial pyramid pooling (EASPP).
  • EASPP efficient atrous spatial pyramid pooling
  • the trained deep network model can be called a deep network model
  • the untrained deep network model can be called an initial deep model
  • the encoder and decoder can be composed of depthwise separable convolutions.
  • EASPP can also include multiple layers of depthwise separable convolutions.
  • the number of encoders and decoders can be the same and correspond one to one. For example, in some cases there may be 5 encoders and therefore 5 decoders.
  • encoder 0 corresponds to decoder
  • encoder 1 corresponds to decoder 1
  • encoder 2 corresponds to decoder 2
  • encoder 3 corresponds to decoder 3
  • encoder 4 corresponds to decoder 4.
  • depth-separable convolutions may not be included for the first encoder and the first encoder's corresponding decoder, while the remaining encoders and corresponding decoders may include depth-separable convolutions.
  • the image to be processed can be input into a deep network model, for example, the image to be processed can be input into at least one encoder to extract depth feature data of the image to be processed.
  • a deep network model may include 5 encoders and 5 decoders.
  • the number of encoders and decoders can be more or less according to the actual situation, and can be set arbitrarily according to the actual situation, which is not limited by this disclosure. It can be understood that when the number of encoders and decoders is more than 5, it may reduce the model running speed and increase the size of the model.
  • each input may be a single image to be processed.
  • the image to be processed may be a color image, for example.
  • the training image can be input into encoder 0.
  • Encoder 0 can include two convolutional layers.
  • the convolution kernel size of each convolutional layer can be 3*3.
  • depth features with a channel dimension of 16 can be output.
  • the depth feature data output by encoder 0 can be input to the next encoder, that is, encoder 1.
  • encoder 1 the depth feature data output by encoder 0 can also be input to the corresponding decoder 0.
  • the first encoder may not have depth-separable convolutions, while the remaining encoders have depth-separable convolutions.
  • all encoders may have depth-separable convolutions, and the details may be adjusted according to actual conditions, and are not specifically limited in this disclosure.
  • downsampling layers and depthwise separable convolutional layers can be included.
  • the downsampling layer can use maximum pooling (max polling, MaxPool) 2D for downsampling.
  • 2D is represented by two dimensions: width and height during image pooling.
  • pooling can be performed by a 2 ⁇ 2 matrix size when performing pooling.
  • Encoder 1 takes the output of Encoder 0 as input, and after MaxPool2D downsampling and depth-separable convolutional layers for feature extraction, it can output depth features with a channel dimension of 32. It can be seen that the depth feature data output by encoder 1 can be input to the next encoder, that is, encoder 2.
  • Encoder 2 may include a downsampling layer and a depth-separable convolutional layer, and the downsampling layer may use MaxPool2D for downsampling.
  • Encoder 2 takes the output of Encoder 1 as input. After MaxPool2D downsampling and depth-separable convolutional layers for feature extraction, it can output depth features with a channel dimension of 64. It can be seen that the depth feature data output by encoder 2 can be input to the next encoder, that is, encoder 3; it can also be input to the corresponding decoder 2.
  • Encoder 3 may include a downsampling layer and a depth-separable convolutional layer, and the downsampling layer may use MaxPool2D for downsampling. Encoder 3 takes the output of Encoder 2 as input. After MaxPool2D downsampling and depth-separable convolutional layers for feature extraction, it can output depth features with a channel dimension of 128. It can be seen that the depth feature data output by encoder 3 can be input to the next encoder, that is, encoder 4; it can also be input to the corresponding decoder 3.
  • Encoder 4 may include a downsampling layer and a depth-separable convolutional layer, and the downsampling layer may use MaxPool2D for downsampling.
  • Encoder 4 takes the output of Encoder 3 as input. After MaxPool2D downsampling and depth-separable convolutional layers for feature extraction, it can output depth features with a channel dimension of 192. It can be seen that the depth feature data output by the encoder 4 can be input to the semantic perception unit, that is, EASPP; it can also be input to the corresponding decoder 4. It can be understood that the channel dimensions of the depth feature data output by each encoder can be obtained by dimensionally upgrading the traditional convolutional layer included in each encoder. The specific number of dimensions can be adjusted arbitrarily according to the actual situation, and is not limited in this disclosure.
  • the depth feature data output by the last encoder will be input to the semantic perception unit.
  • the semantic perception unit can perform semantic extraction based on the output of the last encoder and output deep semantic data containing rich semantic information.
  • a possible semantic perception unit structure is shown. It can be seen that at least one depthwise separable convolution and fusion layer can be included in the semantic perception unit. Among them, depthwise separable convolution has an expansion coefficient. The expansion coefficients corresponding to different depth-separable convolutions in the semantic sensing unit can be the same or different. The expansion coefficient of each depth-separable convolution can be adjusted arbitrarily according to the actual situation, and this disclosure does not limit it.
  • depth-separable convolution 1 takes the depth feature data output by the last encoder (ie, encoder 4 in Figure 4) as input.
  • the expansion coefficient of depth-separable convolution 1 can be 3.
  • depth-separable convolution 1 performs semantic extraction on the depth feature data output by the last encoder, the obtained depth semantic data is transferred to the next depth-separable convolution. That is, depthwise separable convolution 2, and the resulting depth semantic data can also be transferred to the fusion layer.
  • Depthwise separable convolution 2 fuses the output of depthwise separable convolution 1 with the depth feature data output by the last encoder as input.
  • the expansion coefficient of depthwise separable convolution 2 can be 6, and the depthwise separable convolution 2 can have an expansion coefficient of 6.
  • the obtained depth semantic data is transferred to the next depth-separable convolution, namely depth-separable convolution 3, and the obtained depth semantic data can also be transferred to the fusion layer.
  • Depthwise separable convolution 3 uses the output of depthwise separable convolution 2 and the input of depthwise separable convolution 2 as input at the same time.
  • the expansion coefficient of depthwise separable convolution 3 can be 12, and depthwise separable convolution 3 has a After the data is semantically extracted, the obtained depth semantic data is transferred to the next depthwise separable convolution, namely depthwise separable convolution 4, and the obtained depth semantic data can also be transferred to the fusion layer.
  • Depthwise separable convolution 4 and depthwise separable convolution 5 are similar to depthwise separable convolution 3.
  • Depthwise separable convolution 4 uses the output of depthwise separable convolution 3 and the input of depthwise separable convolution 2 as input at the same time. , the expansion coefficient of depthwise separable convolution 4 can be 18.
  • depthwise separable convolution 4 After depthwise separable convolution 4 performs semantic extraction on the input data, the obtained depth semantic data is transferred to the next depthwise separable convolution, that is, depthwise separable convolution.
  • Product 5 and the resulting deep semantic data can also be transferred to the fusion layer.
  • Depthwise separable convolution 5 uses the output of depthwise separable convolution 4 and the input of depthwise separable convolution 2 as input at the same time.
  • the expansion coefficient of depthwise separable convolution 5 can be 24, and the depthwise separable convolution 5 has After semantic extraction of the data, the obtained deep semantic data is input to the fusion layer.
  • the fusion layer fuses the output of each depth-separable convolution and the depth feature data output by the last encoder (ie, encoder 4 in Figure 4) to determine the final depth semantic data and serve as the output of the semantic perception unit .
  • the semantic sensing unit inputs the determined final depth semantic data into the decoder corresponding to the last encoder, which may be decoder 4 in Figure 4, for example. It can be understood that the input of other depth-separable convolutions in the semantic perception unit except the first depth-separable convolution not only includes the output of the previous depth-separable convolution, but also includes the first depth-separable convolution.
  • the output of the convolution, or the output of the last encoder ensures that richer semantic information can be extracted.
  • the fusion layer in Figure 5 can represent a concat operation of the channel dimensions of each depthwise separable convolution output data. It can be understood that the concat represents a concat operation on the same channel dimension, that is to say, the data processing in the semantic perception unit does not change the output channel dimension.
  • Decoder 0 will decode based on the output of decoder 1 and the output of encoder 0. Decoder 1 will output the depth estimation result of the input training image, that is, the depth estimation data corresponding to the training image.
  • the depth estimation data may be a depth map with a channel dimension of 1.
  • depthwise separable convolution can also be used in the decoder, corresponding to the encoder. Therefore, in some examples, the last decoder (i.e., the decoder corresponding to the first encoder) may also not contain depthwise separable convolutions.
  • the structure of the decoder can be similar to that of the encoder and can be used to perform the reverse operation of the encoder.
  • the last decoder in the deep network model (ie, decoder 0 in Figure 4) can output the image depth of the image to be trained.
  • the initial deep model For deep network models in the training stage, that is, the initial deep model.
  • Its input can be training images and the estimated depth is obtained.
  • the estimated depth represents the image depth of the training image output after the training image passes through the initial depth model.
  • each training image can correspond to a label depth, and the label depth is used to represent the real image depth of the corresponding training image.
  • the initial depth model can calculate the loss function based on the estimated depth and label depth, and adjust each parameter in the initial depth model based on the loss function.
  • stochastic gradient descent SGD
  • SGD stochastic gradient descent
  • the trained deep network model can be obtained. It can be understood that the data processing process in the training phase is similar to that in the application phase. Therefore, for specific data processing in the training phase, reference can be made to the corresponding description of data processing in the application phase, which will not be described in detail in this disclosure.
  • the deep network model in this application uses depth-separable convolution, so it can adapt to the scene of the terminal device and be perfectly deployed on the terminal device, so that the terminal device can perform depth estimation of the image based on the trained deep network model.
  • FIG. 6 is a flow chart of an image depth prediction method according to an exemplary embodiment. As shown in Figure 6, this method can be run on the terminal device.
  • terminal equipment can also be called terminal, user equipment (User Equipment, UE), mobile station (Mobile Station, MS), mobile terminal (Mobile Terminal, MT), etc.
  • UE User Equipment
  • MS Mobile Station
  • MT mobile terminal
  • the terminal device can be a handheld device with wireless connection function, a vehicle-mounted device, etc.
  • some examples of terminal devices include smartphones (Mobile Phone), pocket computers (Pocket Personal Computer, PPC), PDAs, personal digital assistants (Personal Digital Assistant, PDA), notebook computers, tablet computers, wearable devices, or Vehicle equipment, etc.
  • the terminal device when it is a vehicle-to-everything (V2X) communication system, the terminal device may also be a vehicle-mounted device. It should be understood that the embodiments of the present disclosure do not limit the specific technology and specific equipment form used by the terminal equipment.
  • V2X vehicle-to-everything
  • the deep network model involved in this method can adopt the network structure described in Figure 4 and Figure 5 above.
  • the present disclosure can use a monocular vision system to implement a 2D image depth estimation task, the input of which is only a single color picture and the output is a depth map represented by grayscale values.
  • this method can also be extended to tasks such as computational photography and autonomous driving situation awareness.
  • the method may include the following steps:
  • step S11 the image to be processed is obtained.
  • the terminal device may obtain an image to be processed that requires depth prediction.
  • the image to be processed can be obtained from other devices through the network, or can be obtained by photographing the terminal device, or can be pre-stored on the terminal device, which is not limited by this disclosure.
  • the network may adopt Code Division Multiple Access (CDMA), Wideband Code Division Multiple Access (WCDMA), Time Division Multiple Access (TDMA), frequency Frequency Division Multiple Access (FDMA), Orthogonal Frequency-Division Multiple Access (OFDMA), Single Carrier Frequency Division Multiple Access (Single Carrier FDMA, SC-FDMA), Carrier Sense Multiple Access Access/conflict avoidance (Carrier Sense Multiple Access with Collision Avoidance) and other methods are implemented.
  • CDMA Code Division Multiple Access
  • WCDMA Wideband Code Division Multiple Access
  • TDMA Time Division Multiple Access
  • FDMA frequency Frequency Division Multiple Access
  • OFDMA Orthogonal Frequency-Division Multiple Access
  • Single Carrier Frequency Division Multiple Access Single Carrier Frequency Division Multiple Access
  • SC-FDMA SC-FDMA
  • Carrier Sense Multiple Access/conflict avoidance Carrier Sense Multiple Access with Collision Avoidance
  • the network can be divided into 2G (English: Generation) network, 3G network, 4G network or future evolution network, such as the fifth generation wireless communication system (The 5th Generation Wireless Communication System, 5G) network, 5G network can also be called New Radio (NR).
  • 2G International: Generation
  • 3G network 3G network
  • 4G network 4G network or future evolution network
  • 5G Fifth Generation Wireless Communication System
  • 5G network can also be called New Radio (NR).
  • NR New Radio
  • step S12 the image to be processed is input to the deep network model, and the image depth of the image to be processed is predicted.
  • the image to be processed obtained in S12 can be input into the deep network model to obtain the predicted image depth of the image to be processed.
  • the deep network model is trained using a target loss function, and the target loss function is determined based on at least one of the error weight parameter, the depth gradient error, and the depth structure loss parameter.
  • the error weight parameter is used to characterize the weight of the difference between the estimated depth and the label depth;
  • the depth gradient error is used to characterize the gradient difference between the estimated depth and the label depth;
  • the depth structure loss parameter is used to characterize the difference in label depth corresponding to different positions in the training image .
  • the deep network model used in this disclosure has depth-separable convolution, thereby realizing the deployment of the deep network model on the terminal device, effectively saving the running time, and avoiding the long-running and unsuitable situation of large-scale models on the terminal device. .
  • the training process can be as shown in Figure 7, for example.
  • Figure 7 is a deep network model training according to an exemplary embodiment.
  • Method flowchart. This method can be run on the service device.
  • the service device may be a server or server cluster. Of course, in other examples, it can also be a server or server cluster running on a virtual machine.
  • the method may include the following steps:
  • step S21 a preconfigured training image set is obtained.
  • the training image set includes training images and label depths corresponding to the training images.
  • the service device may obtain a preconfigured set of training images.
  • the training image set may be pre-stored on the service device.
  • the training image set can be stored in a database, and the service device obtains the training image set by connecting with the corresponding data.
  • the training image set includes training images and the label depth corresponding to the training images.
  • a large amount of dense depth data can be obtained through network data collection, optical flow estimation, binocular stereo matching, and depth estimation teacher model prediction methods to generate training image sets. It can be understood that the dense representation can be obtained through the above method to determine the corresponding label depth for each pixel in the training image.
  • the images in the training image set may be images that have been masked for a set area.
  • the setting area can be the background area of the image, such as the sky area, ocean area, etc.
  • the setting area as the sky area as an example, for training images involving outdoor scenes, such training images usually include the sky.
  • the color of some parts of the sky, clouds, etc. may have a corresponding impact on depth estimation, such as incorrect estimation of the depth of clouds. Therefore, a pre-trained sky segmentation model can also be used to segment the sky area to obtain the sky mask S mask .
  • S mask can be used to process the image containing the sky to mark the sky area with the corresponding label depth.
  • the effective area mask can be used to process the image obtained by binocular stereo matching to obtain a processed image.
  • V mask can be used to represent the effective area mask. After the image is processed through V mask , the resulting images are all valid area images, thus effectively preventing pixels not in the area (ie, pixels in the invalid area) from participating in corresponding calculations in the subsequent training process, such as loss calculations. It can be understood that the image processed by V mask can also be processed by S mask . In other words, the effective area image may include annotated background areas. Afterwards, the masked image can be used as a training image for subsequent training.
  • the acquired training image set can include training images in any scene, so that the generalization ability of the trained deep network model is significantly improved.
  • different training image sets can be divided according to different times and different collection methods, such as ⁇ Data 1 , Data 2 ,..., Data n ⁇ .
  • Data n represents the nth training image set.
  • different sampling weights can be set for different training image sets based on the amount of data contained in each training image set. It can be, for example, as shown in Equation 1,
  • p j represents the sampling weight of the j-th training image set during training.
  • N() represents the counting statistics of samples in the training image set, which can be understood as the amount of data in the training image set.
  • the training image and the label depth corresponding to the training image can be obtained from the corresponding training image set based on the sampling weight calculated by Formula 1. This ensures that the training process will not be biased towards a certain type of training images, making the trained deep network model extremely capable of generalization and can be applied to depth estimation of a variety of different scene images.
  • the images in the training image set can also be changed through data augmentation methods such as horizontal flipping of images, random cropping, and color changes, thereby expanding the amount of data in the data set to meet training needs.
  • step S22 the training image is input to the initial depth model, and the estimated depth corresponding to the training image is determined.
  • the service device can input the training images in the training image set obtained in S21 into an initial depth model composed of depth-separable convolutions for training, and obtain the estimated depth corresponding to the training images.
  • the training images in each training image set obtained based on the sampling weight may be sequentially input into the initial depth model for training.
  • the training images of the initial depth model are input, and the output results can be recorded as p.
  • p can be range-cropped and then multiplied by 255 to obtain the estimated depth corresponding to the training image.
  • clip() means cropping.
  • the output result can be recorded as p, which is usually a value between 0 and 1.
  • the relative depth result D of each pixel can be obtained by cropping and multiplying by 255.
  • the relative depth result D can be used as the estimated depth.
  • step S23 a loss function is used to adjust the initial depth model based on the estimated depth and the label depth.
  • the service device may calculate the loss function based on the estimated depth obtained in S22 and the label depth corresponding to the corresponding training image. And adjust the initial depth model based on the calculated loss function.
  • SGD can be used to adjust the initial depth model, that is, update the corresponding parameters in the initial depth model. It is understandable that the hyperparameters in the model will not be adjusted or updated during the training process.
  • step S24 until the loss function converges, the trained deep network model is obtained.
  • the trained deep network model can be obtained. It can be understood that since the deep network model is composed of depth-separable convolutions, it is suitable for deployment on terminal devices, so that the terminal devices can perform depth estimation of images based on the deep network model.
  • the deep network model obtained by training in this disclosure has depth separable convolution, thereby realizing the deployment of the deep network model on the terminal device, effectively saving the running time, and avoiding the long and unsuitable problems of large models running on the terminal device. Condition.
  • step S31 the target loss function is determined based on at least one of the first loss function, the second loss function, and the third loss function.
  • the target loss function for adjusting the initial depth model may be determined based on at least one of the first loss function, the second loss function, and the third loss function.
  • the first loss function can be determined based on the error weight parameter
  • the second loss function can be determined based on the depth gradient error
  • the third loss function can be determined based on the depth structure loss parameter.
  • the present disclosure can be combined with multiple loss functions to train a deep network model, thereby ensuring that the trained deep network model can more accurately identify image depth.
  • the first loss function involved in S31 can be determined through the following steps:
  • step S41 the absolute value of the error is determined based on the estimated depth and the tag depth.
  • the absolute value of the error can be determined based on the estimated depth predicted by the initial depth model and the label depth corresponding to the corresponding training image. For example, it can be recorded as abs(D pred -D target ). Among them, abs() represents the absolute value, D pred represents the estimated depth predicted by the initial depth model, and D target represents the label depth corresponding to the corresponding training image.
  • step S42 the first loss function is determined based on the absolute value of the error and the error weight parameter.
  • the first loss function may be determined based on the error absolute value determined in S21 and the error weight parameter W determined in S22.
  • the error weight parameter W can be determined based on the estimated depth and label depth.
  • the error weight parameter W can be determined based on the estimated depth predicted by the initial depth model and the label depth corresponding to the corresponding training image.
  • W can be calculated by formula 2.
  • pow(D pred -D target , 2) is expressed as the calculation of the 2nd power of D pred -D target . It can be understood that this weight value can impose greater loss weight on those areas where the prediction error deviation is relatively large. Therefore, formula 2 can also be considered to express a focal attribute.
  • the first loss function can be calculated through Formula 3 and Formula 4.
  • the L 1 loss function can be calculated using the absolute value of the error
  • W ⁇ L 1 represents the L 1 loss function with focal attributes.
  • the L 1 loss function when calculating the L 1 loss function, it can be calculated based on an effective area mask or sky mask. For example, V mask indicates whether each pixel in the effective area is effective. For example, 0 and 1 can be used to distinguish effective and invalid. . abs(D pred -D target ) represents the absolute value of the error of each pixel in the same area.
  • calculation can also be performed pixel by pixel, which is not limited by this disclosure.
  • This disclosure introduces a weight attribute when determining the first loss function, so as to assign a higher loss weight to the location with a large prediction error deviation, so that the initial depth model can be better trained, making the recognition results of the trained deep network model more accurate. for accuracy.
  • the present disclosure can also adjust the initial depth model through one or more of the following loss functions, such as the second loss function and the third loss function.
  • the label depth may include label depth gradients in at least two directions.
  • the second loss function involved in S31 can be determined through the following steps:
  • step S51 estimated depth gradients in at least two directions are determined based on the estimated depth and using a preset gradient function.
  • the estimated depth predicted according to the initial depth model can be brought into a preset gradient function to determine the estimated depth gradient in at least two directions.
  • two different directions can usually be selected.
  • more directions or fewer directions can be selected, which is not limited by this disclosure.
  • the directions may be x and y directions.
  • the estimated depth gradient in the x and y directions it can be calculated by Formula 5 and Formula 6.
  • D can be D pred
  • the D x in the x direction calculated through Formula 5 can be expressed as
  • D y in the y direction calculated through formula 6 can be expressed as in, Represented as convolution on the data using the Sobel operator. in, and That is the matrix more commonly used by the sobel operator.
  • the matrix can be adjusted according to the actual situation, so that gradient data in more directions can be obtained.
  • the gradients involved in Figure 10 have different meanings from the gradients used in SGD for training.
  • the gradients involved in SGD refer to the gradient of the change of the loss function.
  • the gradient involved in Figure 10 refers to the gradient of depth changes in different areas in the training image.
  • step S52 label depth gradients in at least two directions are determined according to the label depth and the gradient function.
  • the label depth may include label depth gradients in at least two directions. Wherein, at least two directions involved in the label depth gradient are the same direction as at least two directions involved in the estimated depth gradient.
  • the label depth gradient in the x direction can also be calculated based on Formula 5 and Formula 6 respectively. And calculate the label depth gradient in the y direction
  • step S53 depth gradient errors in at least two directions are determined based on estimated depth gradients in at least two directions and label depth gradients in at least two directions.
  • depth gradient errors in at least two directions may be determined based on the estimated depth gradients in at least two directions determined in S51 and the tag depth gradients in at least two directions determined in S52. For example, based on the x direction, you can and The depth gradient error in the x direction is calculated, recorded as In the same way, based on the y direction, it can be based on and The depth gradient error in the y direction is calculated, recorded as
  • step S54 a second loss function is determined based on depth gradient errors in at least two directions.
  • the second loss function may be determined based on the depth gradient errors in at least two directions determined in S53. Among them, the second loss function can also be called the gradient loss function.
  • the second loss function can be calculated by formula 7,
  • L grad represents the second loss function.
  • it can be calculated based on an effective area mask or sky mask, that is, combining whether each pixel in the effective area mask or sky mask is valid, and the area mask or sky mask The absolute value of the gradient of all pixels in the x direction, and the absolute value of the gradient error of all pixels in the y direction.
  • calculation can also be performed pixel by pixel, which is not limited by this disclosure.
  • the present disclosure also introduces gradient loss, so that the trained deep network model can more accurately identify the gradient of the image, thereby better identifying the boundaries of different depths.
  • the third loss function involved in S31 can be determined through the following steps:
  • step S61 multiple training data groups are determined for the training images.
  • the service device may determine multiple training data groups for each training image.
  • Each training data group may be composed of at least two pixels in the training image. Therefore, the label depth data of the training image may include depth label data corresponding to the corresponding pixel point.
  • determining multiple training data groups may include: using a preset gradient function to determine the gradient boundary of the training image. Then, multiple training data groups are determined by sampling on the training images according to the gradient boundaries. It can be understood that the process of determining multiple training data groups may be referred to as pair sampling.
  • pair sampling is performed based on gradient boundaries, which can also be called deep structure sampling.
  • the gradient of the training image can be calculated based on the sobel operator to determine which areas have large gradient changes.
  • the gradient threshold can be set in advance. When the gradient difference at different locations is greater than or equal to the gradient threshold, the depth gradient can be considered to have changed significantly, and the corresponding gradient boundary can be determined. It can be understood that for the second loss function described in Figure 10, the accuracy of this part of the gradient boundary calculation can be effectively adjusted. For example, as shown in Figure 11, assuming that the depth map corresponding to the training image is the depth map shown in Figure 12, the schematic diagram of pair sampling based on the gradient boundary can be as shown in Figure 13.
  • each pixel point sampled by the depth structure is located around the gradient boundary.
  • each training data set can be composed of two pixels in the training image. Then you can collect a pixel on one side of the gradient boundary, and then collect a pixel on the gradient boundary to form a training data group. Or you can also collect a pixel on one side of the gradient boundary and then collect a pixel on the other side of the gradient boundary to form a training data group. In the above manner, multiple training data groups can be determined.
  • the present disclosure determines the training data group based on the gradient boundary, which can make the distinction between different depths more accurate in the subsequent training process, thereby making the recognition results of the trained deep network model more accurate.
  • determining multiple training data groups may also include: randomly sampling on the training images to determine multiple training data groups.
  • the present disclosure can also perform random pair sampling on the training image, that is, random sampling. Multiple training groups are thereby determined. For example, as shown in Figure 14, it can be seen that the multiple training data groups obtained by random sampling are evenly distributed, thus effectively retaining pixels corresponding to areas with slow gradient changes.
  • the present disclosure determines the training data group based on the gradient boundary, which can make the distinction between different depths more accurate in the subsequent training process, thereby making the recognition results of the trained deep network model more accurate.
  • step S62 for each training data group, determine the depth structure loss parameter based on the label depth corresponding to at least two pixels.
  • the service device can determine the depth structure loss parameter for each training data group based on the label depth corresponding to at least two pixels in the training data group. In some examples, if the training data set includes two pixels, the depth structure loss parameter can be determined based on the label depth corresponding to the two pixels in the training data set.
  • the depth structure loss parameter can be recorded as ⁇ .
  • can be calculated using Equation 8.
  • a and b respectively represent two pixels in the training data group.
  • step S63 for each training data group, a third loss function is determined based on the depth structure loss parameter and the estimated depth corresponding to at least two pixels.
  • the service device may determine the third loss function based on the depth structure loss parameter ⁇ determined in S62 and the estimated depth corresponding to at least two pixels included in the training data group.
  • the third loss function can also be called the depth map structure sampling loss function.
  • the third loss function can be calculated through Formula 9.
  • L pair represents the third loss function, expressed as e Power. It can be seen that in some examples, the third loss function uses ranking loss (ranking loss) to calculate the loss function.
  • the present disclosure can also determine the third loss function through a training group of two pixels as a group, so that in the subsequent training process, different depth areas in the image can be sampled more accurately, so as to enable the recognition of the deep network model obtained by training. The results are more accurate.
  • the target loss function can be determined based on at least one of the above-mentioned first loss function, second loss function, and third loss function.
  • Formula 10 provides a method for determining the target loss function.
  • L total represents the target loss function
  • the parameter ⁇ 1 is the weighting coefficient of the first loss function
  • ⁇ 2 is the weighting coefficient of the second loss function
  • ⁇ 3 is the weighting coefficient of the third loss function.
  • the specific values of ⁇ 1 , ⁇ 2 and ⁇ 3 can be set according to actual conditions, and are not limited in this disclosure. Obviously, the specific values of ⁇ 1 , ⁇ 2 and ⁇ 3 can be adjusted to use one or more of the first loss function, the second loss function and the third loss function to obtain the target loss function to train the deep network Model.
  • the first loss function, the second loss function and the third loss function have different purposes for adjusting the model.
  • the input image may be an image to be processed or a training image, which may include: normalizing the input image to obtain a normalized input image. Then, the normalized input image is fed into the deep network model.
  • the mean and variance of the color channels of the input image can be combined, for example, as shown in Equation 11.
  • m represents the mean
  • v represents the variance
  • I represents the input image
  • I norm represents the normalized input image.
  • the color channels of I can be arranged according to BGR. It can be understood that B represents the blue channel, G represents the green channel, and R represents the red channel.
  • m can take (0.485, 0.456, 0.506), and v can take (0.229, 0.224, 0.225). It can be understood that the above m and v are only an exemplary description of one value. In other examples, any value can be set according to the actual situation, and this disclosure does not limit it.
  • the depth of the labels corresponding to the training images can also be normalized. Since training image sets come from a wide range of sources, the depth level (scale) and transformation (shift) of the label depth corresponding to the training images in different training image sets may be inconsistent. Therefore, the depth of the labels corresponding to the training images can be normalized. Of course, it is understandable that this method can also normalize the estimated depth predicted by the initial depth model.
  • the median label depth can be calculated by Equation 12.
  • D t represents the median value of D.
  • D represents the label depth corresponding to each pixel in the effective area.
  • M represents the number of effective pixels in the effective area V mask . It is understood that V mask can also be replaced by S mask .
  • the normalized label depth can be determined based on D t determined in Equation 12 and D s in Equation 13. That is, as shown in Equation 14
  • the present disclosure can also normalize the images of the input model, ensuring that the training can be completed in the same dimension during the training process, thereby making the recognition results of the trained deep network model more accurate.
  • color pictures in jpg image format may be used as training pictures.
  • depth images in png image format can be used as depth label data corresponding to training images.
  • png picture format can be used. If training is performed based on pair training groups, in some examples, a training data group of about 800,000, that is, a pair group, can be used.
  • the deep network model used in this disclosure has depth separable convolution, thereby realizing the deployment of the deep network model on the terminal device, effectively saving the running time, and avoiding the long and unsuitable running time of large models on the terminal device. Condition.
  • embodiments of the present disclosure also provide an image depth prediction device.
  • the image depth prediction device provided by the embodiments of the present disclosure includes hardware structures and/or software modules corresponding to each function. Combined with the units and algorithm steps of each example disclosed in the embodiments of the present disclosure, the embodiments of the present disclosure can be implemented in the form of hardware or a combination of hardware and computer software. Whether a function is performed by hardware or computer software driving the hardware depends on the specific application and design constraints of the technical solution. Those skilled in the art can use different methods to implement the described functions for each specific application, but such implementation should not be considered to go beyond the scope of the technical solutions of the embodiments of the present disclosure.
  • Figure 15 is a schematic diagram of a deep network model training device according to an exemplary embodiment.
  • the device 100 may include: an acquisition module 101, used to acquire an image to be processed; a prediction module 102, used to input the image to be processed into a deep network model, and predict the image depth of the image to be processed.
  • the deep network model is composed of multi-layer depth separable convolutions; among them, the deep network model is trained using a target loss function, and the target loss function is determined based on at least one of the error weight parameter, depth gradient error, and depth structure loss parameter; the error The weight parameter is used to characterize the weight of the difference between the estimated depth and the label depth, the depth gradient error is used to characterize the gradient difference between the estimated depth and the label depth, and the depth structure loss parameter is used to characterize the difference in label depth corresponding to different positions in the training image; The estimated depth is determined by the deep network model in the training stage based on the training image, and the label depth corresponds to the training image.
  • the deep network model used in this disclosure has depth-separable convolution, thereby realizing the deployment of the deep network model on the terminal device, effectively saving the running time, and avoiding the long-running and unsuitable situation of large-scale models on the terminal device. .
  • the target loss function is determined in the following manner: the target loss function is determined based on at least one of the first loss function, the second loss function, and the third loss function; wherein the first loss function is based on The error weight parameter is determined, the second loss function is determined based on the depth gradient error, and the third loss function is determined based on the depth structure loss parameter.
  • the present disclosure can be combined with multiple loss functions to train a deep network model, thereby ensuring that the trained deep network model can more accurately identify the depth of the image.
  • the first loss function is determined in the following manner: based on the estimated depth and the label depth, the absolute value of the error between the estimated depth and the label depth is determined; based on the absolute value of the error and the error weight parameter, the first loss function is determined. loss function.
  • This disclosure introduces a weight attribute when determining the first loss function, so as to assign a higher loss weight to the location with a large prediction error deviation, so that the initial depth model can be better trained, making the recognition results of the trained deep network model more accurate. for accuracy.
  • the depth gradient error is determined in the following manner: based on the estimated depth and a preset gradient function, determine the estimated depth gradient in at least two directions; based on the label depth and the gradient function, determine at least two direction; determine the depth gradient error in at least two directions according to the estimated depth gradient in at least two directions and the label depth gradient in at least two directions; the second loss function is determined in the following way: based on at least two directions The depth gradient error in each direction determines the second loss function.
  • the present disclosure also introduces gradient loss, so that the trained deep network model can more accurately identify the gradient of the image, thereby better identifying the boundaries of different depths.
  • the depth structure loss parameters are determined in the following manner: multiple training data groups are determined, wherein the training data groups include at least two pixels, and the at least two pixels are pixels of the training image, and the label The depth includes training labels corresponding to at least two pixels in the training image; for each training data group, the depth structure loss parameters are determined based on the depth of the labels corresponding to at least two pixels; the third loss function is determined in the following way: for each A training data group is formed, and a third loss function is determined based on the depth structure loss parameter and the estimated depth corresponding to at least two pixels.
  • the present disclosure can also determine the third loss function through a training group of two pixels as a group, so that in the subsequent training process, different depth areas in the image can be sampled more accurately, so as to enable the recognition of the deep network model obtained by training. The results are more accurate.
  • determining multiple training data groups includes: using a preset gradient function to determine the gradient boundary of the training image; sampling the training image according to the gradient boundary to determine multiple training data groups.
  • the present disclosure determines the training data group based on the gradient boundary, which can make the distinction between different depths more accurate in the subsequent training process, thereby making the recognition results of the trained deep network model more accurate.
  • the training images are determined in the following manner: determine at least one training image set in any scene, and the images in the training image set are images after mask processing for the set area; according to the preconfigured sampling The weights determine training images from at least one training image set.
  • the present disclosure enables the trained deep network model to have stronger generalization capabilities and can achieve depth prediction in any scenario. And by masking the set area, the processed image is more conducive to model training and improves the prediction accuracy after model training.
  • FIG. 16 is a block diagram of an image depth prediction device 200 according to an exemplary embodiment.
  • the device 200 may be a mobile phone, a computer, a digital broadcast terminal, a messaging device, a game console, a tablet device, a medical device, a fitness device, a personal digital assistant, a RedCap terminal and other terminal devices.
  • device 200 may include one or more of the following components: processing component 202, memory 204, power component 206, multimedia component 208, audio component 210, input/output (I/O) interface 212, sensor component 214, and Communication component 216.
  • Processing component 202 generally controls the overall operations of device 200, such as operations associated with display, phone calls, data communications, camera operations, and recording operations.
  • the processing component 202 may include one or more processors 220 to execute instructions to complete all or part of the steps of the above method.
  • processing component 202 may include one or more modules that facilitate interaction between processing component 202 and other components.
  • processing component 202 may include a multimedia module to facilitate interaction between multimedia component 208 and processing component 202.
  • Memory 204 is configured to store various types of data to support operations at device 200 . Examples of such data include instructions for any application or method operating on device 200, contact data, phonebook data, messages, pictures, videos, etc.
  • Memory 204 may be implemented by any type of volatile or non-volatile storage device, or a combination thereof, such as static random access memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EEPROM), Programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disk.
  • SRAM static random access memory
  • EEPROM electrically erasable programmable read-only memory
  • EEPROM erasable programmable read-only memory
  • EPROM Programmable read-only memory
  • PROM programmable read-only memory
  • ROM read-only memory
  • magnetic memory flash memory, magnetic or optical disk.
  • Power component 206 provides power to various components of device 200 .
  • Power components 206 may include a power management system, one or more power supplies, and other components associated with generating, managing, and distributing power to device 200 .
  • Multimedia component 208 includes a screen that provides an output interface between the device 200 and the user.
  • the screen may include a liquid crystal display (LCD) and a touch panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive input signals from the user.
  • the touch panel includes one or more touch sensors to sense touches, swipes, and gestures on the touch panel. The touch sensor may not only sense the boundary of a touch or slide action, but also detect the duration and pressure associated with the touch or slide action.
  • multimedia component 208 includes a front-facing camera and/or a rear-facing camera.
  • the front camera and/or the rear camera can receive external multimedia data.
  • Each front-facing camera and rear-facing camera can be a fixed optical lens system or have a focal length and optical zoom capabilities.
  • Audio component 210 is configured to output and/or input audio signals.
  • audio component 210 includes a microphone (MIC) configured to receive external audio signals when device 200 is in operating modes, such as call mode, recording mode, and speech recognition mode. The received audio signals may be further stored in memory 204 or sent via communications component 216 .
  • audio component 210 also includes a speaker for outputting audio signals.
  • the I/O interface 212 provides an interface between the processing component 202 and a peripheral interface module, which may be a keyboard, a click wheel, a button, etc. These buttons may include, but are not limited to: Home button, Volume buttons, Start button, and Lock button.
  • Sensor component 214 includes one or more sensors for providing various aspects of status assessment for device 200 .
  • the sensor component 214 can detect the open/closed state of the device 200, the relative positioning of components, such as the display and keypad of the device 200, and the sensor component 214 can also detect a change in position of the device 200 or a component of the device 200. , the presence or absence of user contact with device 200 , device 200 orientation or acceleration/deceleration and temperature changes of device 200 .
  • Sensor assembly 214 may include a proximity sensor configured to detect the presence of nearby objects without any physical contact.
  • Sensor assembly 214 may also include a light sensor, such as a CMOS or CCD image sensor, for use in imaging applications.
  • the sensor component 214 may also include an acceleration sensor, a gyroscope sensor, a magnetic sensor, a pressure sensor, or a temperature sensor.
  • Communication component 216 is configured to facilitate wired or wireless communications between device 200 and other devices.
  • Device 200 may access a wireless network based on a communication standard, such as WiFi, 2G, 3G, 4G or 5G, or a combination thereof.
  • the communication component 216 receives broadcast signals or broadcast related information from an external broadcast management system via a broadcast channel.
  • the communications component 216 also includes a near field communications (NFC) module to facilitate short-range communications.
  • NFC near field communications
  • the NFC module can be implemented based on radio frequency identification (RFID) technology, infrared data association (IrDA) technology, ultra-wideband (UWB) technology, Bluetooth (BT) technology and other technologies.
  • RFID radio frequency identification
  • IrDA infrared data association
  • UWB ultra-wideband
  • Bluetooth Bluetooth
  • device 200 may be configured by one or more application specific integrated circuits (ASICs), digital signal processors (DSPs), digital signal processing devices (DSPDs), programmable logic devices (PLDs), field programmable Gate array (FPGA), controller, microcontroller, microprocessor or other electronic components are implemented for executing the above method.
  • ASICs application specific integrated circuits
  • DSPs digital signal processors
  • DSPDs digital signal processing devices
  • PLDs programmable logic devices
  • FPGA field programmable Gate array
  • controller microcontroller, microprocessor or other electronic components are implemented for executing the above method.
  • a non-transitory computer-readable storage medium including instructions such as a memory 204 including instructions, which can be executed by the processor 220 of the device 200 to complete the above method is also provided.
  • the non-transitory computer-readable storage medium may be ROM, random access memory (RAM), CD-ROM, magnetic tape, floppy disk, optical data storage device, etc.
  • FIG. 17 is a schematic diagram of a deep network model training device 300 according to an exemplary embodiment.
  • device 300 may be provided as a server. It can be understood that the device 300 can be used to implement training of a deep network model.
  • device 300 includes processing component 322 , which further includes one or more processors, and memory resources, represented by memory 332 , for storing instructions, such as application programs, executable by processing component 322 .
  • the application program stored in memory 332 may include one or more modules, each of which corresponds to a set of instructions.
  • the processing component 322 is configured to execute instructions to perform the process method of training the deep network model in the above method.
  • Device 300 may also include a power supply component 326 configured to perform power management of device 300, a wired or wireless network interface 350 configured to connect device 300 to a network, and an input-output (I/O) interface 358.
  • Device 300 may operate based on an operating system stored in memory 332, such as Windows ServerTM, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM or the like.
  • the depth estimation network comprehensively designed in this disclosure is lightweight and highly accurate, and can be deployed in handheld device scenarios.
  • an efficient semantic information acquisition module EASPP is designed in the depth estimation network to capture semantic information with a smaller amount of parameters and calculations.
  • the present disclosure also uses a multi-dimensional depth loss function to improve the quality of depth prediction at object edges.
  • the above-mentioned solution of the present disclosure solves the problem that the depth estimation model takes a long time in the handheld device scene, and can be deployed on the mobile phone.
  • the floating-point model can reach 150ms on the Qualcomm 7325 platform.
  • the present disclosure can achieve depth estimation in any scenario and has stronger generalization capabilities.
  • the present disclosure can also accurately predict the depth of portrait edges, green plant boundaries, etc., and meet the requirements for depth estimation results when shooting blurred scenes.
  • “plurality” in this disclosure refers to two or more, and other quantifiers are similar.
  • “And/or” describes the relationship between related objects, indicating that there can be three relationships.
  • a and/or B can mean: A exists alone, A and B exist simultaneously, and B exists alone.
  • the character “/” generally indicates that the related objects are in an “or” relationship.
  • the singular forms “a”, “the” and “the” are intended to include the plural forms as well, unless the context clearly indicates otherwise.
  • first, second, etc. are used to describe various information, but such information should not be limited to these terms. These terms are only used to distinguish information of the same type from each other and do not imply a specific order or importance. In fact, expressions such as “first” and “second” can be used interchangeably.
  • first information may also be called second information, and similarly, the second information may also be called first information.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Biomedical Technology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Image Processing (AREA)
  • Image Analysis (AREA)

Abstract

The present invention relates to an image depth prediction method and apparatus, a device, and a storage medium. The method comprises: obtaining an image to be processed; and inputting the image to be processed into a deep network model to predict the image depth of the image to be processed, wherein the deep network model consists of multiple depthwise separable convolution layers, and the deep network model is trained by means of a target loss function determined on the basis of at least one of an error weight parameter, a depth gradient error, and a depth structure loss parameter. The deep network model used in the present invention has depthwise separable convolutions; thus, deployment of the deep network model on a terminal device is implemented, the running time is effectively reduced, and the situation where large models take a long time to run on terminal devices and do not adapt to terminal devices is avoided.

Description

一种图像深度预测方法、装置、设备及存储介质An image depth prediction method, device, equipment and storage medium 技术领域Technical field
本公开涉及计算机视觉技术领域,尤其涉及一种图像深度预测方法、装置、设备及存储介质。The present disclosure relates to the field of computer vision technology, and in particular, to an image depth prediction method, device, equipment and storage medium.
背景技术Background technique
随着科技水平的快速发展,终端设备成为了日常生活中人们不可或缺的物品。在一些情况下,终端设备需要对一张图片进行深度估计,从而完成后续任务。例如,人们在进行拍照时,往往想要达到相机拍照时产生的虚化。但手机上的镜头无法和相机镜头相比,此时需要手机对拍照的图片进行相应的深度估计,从而可以针对图片中不同深度的区域进行相应的虚化。又或者是车载设备通过获取图像并进行深度估计,从而可以实现自动驾驶态势的感知。With the rapid development of science and technology, terminal equipment has become an indispensable item in people's daily lives. In some cases, the terminal device needs to estimate the depth of a picture to complete subsequent tasks. For example, when people take pictures, they often want to achieve the blur produced by the camera when taking pictures. However, the lens on the mobile phone cannot be compared with the camera lens. At this time, the mobile phone needs to estimate the depth of the picture taken, so that the areas of different depths in the picture can be blurred accordingly. Or the vehicle-mounted device can realize autonomous driving situation perception by acquiring images and performing depth estimation.
相关技术中,所使用的深度估计方案通常会采用参数量、计算量较大的网络。此类方案往往会部署在大型计算设备上,例如大型服务设备、服务集群等。显然,此类方案无法适配终端设备。In related technologies, the depth estimation scheme used usually uses a network with a large number of parameters and a large amount of calculation. Such solutions are often deployed on large-scale computing equipment, such as large-scale service equipment, service clusters, etc. Obviously, such a solution cannot be adapted to terminal equipment.
发明内容Contents of the invention
为克服相关技术中存在的问题,本公开提供了一种图像深度预测方法、装置、设备及存储介质。In order to overcome problems existing in related technologies, the present disclosure provides an image depth prediction method, device, equipment and storage medium.
根据本公开实施例的第一方面,提供了一种图像深度预测方法,方法包括:获取待处理图像;将待处理图像输入至深度网络模型,预测待处理图像的图像深度,深度网络模型由多层深度可分离卷积构成;其中,深度网络模型采用目标损失函数训练得到,目标损失函数基于误差权重参数、深度梯度误差以及深度结构损失参数中的至少一项确定;误差权重参数用于表征估计深度和标签深度之间差异的权重,深度梯度误差用于表征估计深度和标签深度之间梯度差异,深度结构损失参数用于表征训练图像中不同位置对应的标签深度差异;估计深度为训练阶段深度网络模型基于训练图像确定的,标签深度与训练图像相对应。According to a first aspect of an embodiment of the present disclosure, an image depth prediction method is provided. The method includes: acquiring an image to be processed; inputting the image to be processed into a deep network model, and predicting the image depth of the image to be processed. The deep network model is composed of multiple It consists of layer depth separable convolutions; among them, the deep network model is trained using a target loss function, and the target loss function is determined based on at least one of the error weight parameter, depth gradient error, and depth structure loss parameter; the error weight parameter is used to characterize the estimation The weight of the difference between depth and label depth. The depth gradient error is used to characterize the gradient difference between the estimated depth and the label depth. The depth structure loss parameter is used to characterize the difference in label depth corresponding to different positions in the training image; the estimated depth is the depth of the training stage. The network model is determined based on the training images, and the label depth corresponds to the training images.
在一种实施方式中,目标损失函数采用如下方式确定:根据第一损失函数、第二损失函数和第三损失函数中的至少一项,确定目标损失函数;其中,第一损失函数基于误差权重参数确定,第二损失函数基于深度梯度误差确定,第三损失函数基于深度结构损失参数 确定。In one implementation, the target loss function is determined in the following manner: the target loss function is determined based on at least one of the first loss function, the second loss function, and the third loss function; wherein the first loss function is based on the error weight The parameters are determined, the second loss function is determined based on the depth gradient error, and the third loss function is determined based on the depth structure loss parameters.
在一种实施方式中,第一损失函数采用如下方式确定:根据估计深度和标签深度,确定估计深度和标签深度之间的误差绝对值;根据误差绝对值以及误差权重参数,确定第一损失函数。In one implementation, the first loss function is determined in the following manner: based on the estimated depth and the label depth, the absolute value of the error between the estimated depth and the label depth is determined; based on the absolute value of the error and the error weight parameter, the first loss function is determined .
在一种实施方式中,深度梯度误差采用如下方式确定:根据估计深度和预先设定的梯度函数,确定至少两个方向上的估计深度梯度;根据标签深度和梯度函数,确定至少两个方向上的标签深度梯度;根据至少两个方向上的估计深度梯度和至少两个方向上的标签深度梯度,确定至少两个方向的深度梯度误差;第二损失函数采用如下方式确定:根据至少两个方向的深度梯度误差,确定第二损失函数。In one embodiment, the depth gradient error is determined in the following manner: based on the estimated depth and a preset gradient function, the estimated depth gradient in at least two directions is determined; based on the label depth and the gradient function, the estimated depth gradient in at least two directions is determined. the label depth gradient; determine the depth gradient error in at least two directions according to the estimated depth gradient in at least two directions and the label depth gradient in at least two directions; the second loss function is determined in the following way: Depth gradient error, determine the second loss function.
在一种实施方式中,深度结构损失参数采用如下方式确定:确定多个训练数据组,其中,训练数据组包含至少两个像素点,至少两个像素点为训练图像的像素点,标签深度包括训练图像中至少两个像素点对应的训练标签;针对每个训练数据组,根据至少两个像素点对应的标签深度,确定深度结构损失参数;第三损失函数采用如下方式确定:针对每个训练数据组,根据深度结构损失参数和至少两个像素点对应的估计深度,确定第三损失函数。In one implementation, the depth structure loss parameters are determined in the following manner: multiple training data groups are determined, wherein the training data group contains at least two pixels, at least two pixels are pixels of the training image, and the label depth includes Training labels corresponding to at least two pixels in the training image; for each training data group, determine the depth structure loss parameters based on the depth of the labels corresponding to at least two pixels; the third loss function is determined in the following way: for each training The data set determines the third loss function based on the depth structure loss parameter and the estimated depth corresponding to at least two pixel points.
在一种实施方式中,确定多个训练数据组,包括:利用预先设定的梯度函数确定训练图像的梯度边界;根据梯度边界在训练图像上进行采样,确定多个训练数据组。In one implementation, determining multiple training data groups includes: using a preset gradient function to determine the gradient boundary of the training image; sampling the training image according to the gradient boundary to determine multiple training data groups.
在一种实施方式中,训练图像采用如下方式确定:确定任意场景下的至少一个训练图像集,训练图像集中的图像是针对设定区域进掩膜处理后的图像;根据预先配置的采样权重从至少一个训练图像集中确定训练图像。In one implementation, the training images are determined in the following manner: determine at least one training image set in any scene, and the images in the training image set are images after mask processing for the set area; At least one training image set determines the training image.
根据本公开实施例的第二方面,提供了一种图像深度预测装置,装置包括:获取模块,用于获取待处理图像;预测模块,用于将待处理图像输入至深度网络模型,预测待处理图像的图像深度,深度网络模型由多层深度可分离卷积构成;其中,深度网络模型采用目标损失函数训练得到,目标损失函数基于误差权重参数、深度梯度误差以及深度结构损失参数中的至少一项确定;误差权重参数用于表征估计深度和标签深度之间差异的权重,深度梯度误差用于表征估计深度和标签深度之间梯度差异,深度结构损失参数用于表征训练图像中不同位置对应的标签深度差异;估计深度为训练阶段深度网络模型基于训练图像确定的,标签深度与训练图像相对应。According to a second aspect of the embodiment of the present disclosure, an image depth prediction device is provided. The device includes: an acquisition module for acquiring an image to be processed; a prediction module for inputting the image to be processed into a deep network model and predicting the image to be processed. The image depth of the image, the deep network model is composed of multi-layer depth separable convolutions; among them, the deep network model is trained using a target loss function, and the target loss function is based on at least one of the error weight parameter, the depth gradient error, and the depth structure loss parameter. The term is determined; the error weight parameter is used to characterize the weight of the difference between the estimated depth and the label depth, the depth gradient error is used to characterize the gradient difference between the estimated depth and the label depth, and the depth structure loss parameter is used to characterize the weight corresponding to different positions in the training image. Label depth difference; the estimated depth is determined by the deep network model in the training stage based on the training image, and the label depth corresponds to the training image.
在一种实施方式中,目标损失函数采用如下方式确定:根据第一损失函数、第二损失函数和第三损失函数中的至少一项,确定目标损失函数;其中,第一损失函数基于误差权重参数确定,第二损失函数基于深度梯度误差确定,第三损失函数基于深度结构损失参数 确定。In one implementation, the target loss function is determined in the following manner: the target loss function is determined based on at least one of the first loss function, the second loss function, and the third loss function; wherein the first loss function is based on the error weight The parameters are determined, the second loss function is determined based on the depth gradient error, and the third loss function is determined based on the depth structure loss parameters.
在一种实施方式中,第一损失函数采用如下方式确定:根据估计深度和标签深度,确定估计深度和标签深度之间的误差绝对值;根据误差绝对值以及误差权重参数,确定第一损失函数。In one implementation, the first loss function is determined in the following manner: based on the estimated depth and the label depth, the absolute value of the error between the estimated depth and the label depth is determined; based on the absolute value of the error and the error weight parameter, the first loss function is determined .
在一种实施方式中,深度梯度误差采用如下方式确定:根据估计深度和预先设定的梯度函数,确定至少两个方向上的估计深度梯度;根据标签深度和梯度函数,确定至少两个方向上的标签深度梯度;根据至少两个方向上的估计深度梯度和至少两个方向上的标签深度梯度,确定至少两个方向的深度梯度误差;第二损失函数采用如下方式确定:根据至少两个方向的深度梯度误差,确定第二损失函数。In one embodiment, the depth gradient error is determined in the following manner: based on the estimated depth and a preset gradient function, the estimated depth gradient in at least two directions is determined; based on the label depth and the gradient function, the estimated depth gradient in at least two directions is determined. the label depth gradient; determine the depth gradient error in at least two directions according to the estimated depth gradient in at least two directions and the label depth gradient in at least two directions; the second loss function is determined in the following way: Depth gradient error, determine the second loss function.
在一种实施方式中,深度结构损失参数采用如下方式确定:确定多个训练数据组,其中,训练数据组包含至少两个像素点,至少两个像素点为训练图像的像素点,标签深度包括训练图像中至少两个像素点对应的训练标签;针对每个训练数据组,根据至少两个像素点对应的标签深度,确定深度结构损失参数;第三损失函数采用如下方式确定:针对每个训练数据组,根据深度结构损失参数和至少两个像素点对应的估计深度,确定第三损失函数。In one implementation, the depth structure loss parameters are determined in the following manner: multiple training data groups are determined, wherein the training data group contains at least two pixels, at least two pixels are pixels of the training image, and the label depth includes Training labels corresponding to at least two pixels in the training image; for each training data group, determine the depth structure loss parameters based on the depth of the labels corresponding to at least two pixels; the third loss function is determined in the following way: for each training The data set determines the third loss function based on the depth structure loss parameter and the estimated depth corresponding to at least two pixel points.
在一种实施方式中,确定多个训练数据组,包括:利用预先设定的梯度函数确定训练图像的梯度边界;根据梯度边界在训练图像上进行采样,确定多个训练数据组。In one implementation, determining multiple training data groups includes: using a preset gradient function to determine the gradient boundary of the training image; sampling the training image according to the gradient boundary to determine multiple training data groups.
在一种实施方式中,训练图像采用如下方式确定:确定任意场景下的至少一个训练图像集,训练图像集中的图像是针对设定区域进掩膜处理后的图像;根据预先配置的采样权重从至少一个训练图像集中确定训练图像。In one implementation, the training images are determined in the following manner: determine at least one training image set in any scene, and the images in the training image set are images after mask processing for the set area; At least one training image set determines the training image.
根据本公开实施例的第三方面,提供了一种图像深度预测设备,包括:处理器;用于存储处理器可执行指令的存储器;其中,处理器被配置为执行第一方面或第一方面任意一种实施方式中的方法。According to a third aspect of an embodiment of the present disclosure, an image depth prediction device is provided, including: a processor; a memory for storing instructions executable by the processor; wherein the processor is configured to perform the first aspect or the first aspect method in any embodiment.
根据本公开实施例的第四方面,提供了一种非临时性计算机可读存储介质,当存储介质中的指令由计算机的处理器执行时,使得计算机能够执行第一方面或第一方面任意一种实施方式中所述的方法。According to a fourth aspect of an embodiment of the present disclosure, a non-transitory computer-readable storage medium is provided, which when instructions in the storage medium are executed by a processor of a computer, enables the computer to execute the first aspect or any one of the first aspects. methods described in the embodiments.
本公开的实施例提供的技术方案可以包括以下有益效果:采用的深度网络模型具有深度可分离卷积,从而实现深度网络模型在终端设备上的部署,并有效节约了运行时长,避免了大型模型在终端设备上运行耗时长、不适配的情况。The technical solution provided by the embodiments of the present disclosure can include the following beneficial effects: the adopted deep network model has depth-separable convolution, thereby realizing the deployment of the deep network model on the terminal device, effectively saving the running time, and avoiding the need for large-scale models. It takes a long time to run on the terminal device and is not suitable.
应当理解的是,以上的一般描述和后文的细节描述仅是示例性和解释性的,并不能限制本公开。It should be understood that the foregoing general description and the following detailed description are exemplary and explanatory only, and do not limit the present disclosure.
附图说明Description of the drawings
此处的附图被并入说明书中并构成本说明书的一部分,示出了符合本公开的实施例,并与说明书一起用于解释本公开的原理。The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the disclosure and together with the description, serve to explain the principles of the disclosure.
图1是根据一示例性实施例示出的一种场景示意图。Figure 1 is a schematic diagram of a scene according to an exemplary embodiment.
图2是根据一示例性实施例示出的一种基于深度预测虚化图像示意图。Figure 2 is a schematic diagram of a blurred image based on depth prediction according to an exemplary embodiment.
图3是根据一示例性实施例示出的一种全监督深度估计流程示意图。Figure 3 is a schematic diagram of a fully supervised depth estimation process according to an exemplary embodiment.
图4是根据一示例性实施例示出的一种深度网络模型结构示意图。Figure 4 is a schematic structural diagram of a deep network model according to an exemplary embodiment.
图5是根据一示例性实施例示出的一种语义感知单元结构示意图。Figure 5 is a schematic structural diagram of a semantic perception unit according to an exemplary embodiment.
图6是根据一示例性实施例示出的一种图像深度预测方法流程图。Figure 6 is a flow chart of an image depth prediction method according to an exemplary embodiment.
图7是根据一示例性实施例示出的一种深度网络模型训练方法流程图。Figure 7 is a flow chart of a deep network model training method according to an exemplary embodiment.
图8是根据一示例性实施例示出的一种目标损失函数确定方法流程图。Figure 8 is a flow chart of a method for determining a target loss function according to an exemplary embodiment.
图9是根据一示例性实施例示出的一种确定第一损失函数方法流程图。Figure 9 is a flow chart of a method for determining a first loss function according to an exemplary embodiment.
图10是根据一示例性实施例示出的一种确定第二损失函数方法流程图。Figure 10 is a flow chart of a method for determining a second loss function according to an exemplary embodiment.
图11是根据一示例性实施例示出的一种确定第三损失函数方法流程图。Figure 11 is a flow chart of a method for determining a third loss function according to an exemplary embodiment.
图12是根据一示例性实施例示出的一种深度图示意图。Figure 12 is a schematic diagram of a depth map according to an exemplary embodiment.
图13是根据一示例性实施例示出的一种深度结构采样示意图。Figure 13 is a schematic diagram of depth structure sampling according to an exemplary embodiment.
图14是根据一示例性实施例示出的一种随机采样示意图。Figure 14 is a schematic diagram of random sampling according to an exemplary embodiment.
图15是根据一示例性实施例示出的一种图像深度预测装置示意图。Figure 15 is a schematic diagram of an image depth prediction device according to an exemplary embodiment.
图16是根据一示例性实施例示出的一种图像深度预测设备示意图。Figure 16 is a schematic diagram of an image depth prediction device according to an exemplary embodiment.
图17是根据一示例性实施例示出的一种深度网络模型训练设备示意图。Figure 17 is a schematic diagram of a deep network model training device according to an exemplary embodiment.
具体实施方式Detailed ways
这里将详细地对示例性实施例进行说明,其示例表示在附图中。下面的描述涉及附图时,除非另有表示,不同附图中的相同数字表示相同或相似的要素。以下示例性实施例中所描述的实施方式并不代表与本公开相一致的所有实施方式。Exemplary embodiments will be described in detail herein, examples of which are illustrated in the accompanying drawings. When the following description refers to the drawings, the same numbers in different drawings refer to the same or similar elements unless otherwise indicated. The implementations described in the following exemplary embodiments do not represent all implementations consistent with the present disclosure.
本公开所涉及的方法可以应用于对图像进行深度预测、估计的场景,例如图1所示出场景中,用户通过手机进行自拍,往往希望得到的图像中自己的人物部分较为清晰,而背景可以进行适当虚化。显然,用户所希望的是将图像中不同深度的区域进行相应的处理,例如深度较小的区域,即人物区域可以保持清晰,而深度较大的区域,如背景区域则可以进行相应的虚化处理。The methods involved in the present disclosure can be applied to scenes of depth prediction and estimation of images. For example, in the scene shown in Figure 1, users take selfies with their mobile phones and often hope that their characters in the images will be clearer, while the background can be Perform appropriate blurring. Obviously, what the user wants is to process areas with different depths in the image accordingly. For example, areas with smaller depth, that is, the character area, can remain clear, while areas with greater depth, such as the background area, can be blurred accordingly. deal with.
然而通常情况下,拍摄处的图像清晰与否与镜头的独特设计有关,拍摄出的图像其焦平面清晰而两边虚化。但由于手机上镜头设计的局限性,使得手机镜头拍摄出来的图像的虚化效果无法达到相机拍摄出的图像的虚化效果。此时,部分手机则可以通过对图像中不同区域进行深度预测,并基于预测出的深度信息作为引导信息将图像进行分层。手机可以对不同图像层使用不同的虚化卷积核进行虚化,最后将不同层的虚化结果进行融合,得到最终虚化后的图像。However, usually, the clarity of the image at the shooting location is related to the unique design of the lens. The focal plane of the captured image is clear but the sides are blurred. However, due to the limitations of the lens design on mobile phones, the blurring effect of images captured by mobile phone lenses cannot achieve the blurring effect of images captured by cameras. At this time, some mobile phones can perform depth prediction on different areas in the image and layer the image based on the predicted depth information as guidance information. The mobile phone can use different blur convolution kernels to blur different image layers, and finally fuse the blur results of different layers to obtain the final blurred image.
例如图2为根据一示例性实施例示出的一种基于深度预测虚化图像示意图。可以看出,比如用户在某些复杂的环境中使用手机进行自拍。此时需要对手机拍摄的图像(即图2中最左侧的图像)进行深度预测,然后可以根据预测得到的深度信息对图像进行分层。针对不同层可以采用不同层的虚化卷积进行虚化,最后再将不同层的虚化结果进行融合,得到最终虚化后的图像(即图2中最右侧的图像)。For example, FIG. 2 is a schematic diagram of a blurred image based on depth prediction according to an exemplary embodiment. It can be seen that, for example, users use mobile phones to take selfies in certain complex environments. At this time, it is necessary to perform depth prediction on the image captured by the mobile phone (ie, the leftmost image in Figure 2), and then the image can be layered based on the predicted depth information. For different layers, blur convolutions of different layers can be used for blurring, and finally the blurring results of different layers are fused to obtain the final blurred image (i.e., the rightmost image in Figure 2).
通常情况下,可以认为用户使用手机进行拍照,是采用单目相机进行拍照。Under normal circumstances, it can be considered that users use mobile phones to take pictures and use monocular cameras to take pictures.
在一些相关方案中,基于单目相机进行深度预测的方案可以采用自监督深度估计,可以理解深度预测也可以称为深度估计。例如可以构建深度估计网络和相对位姿估计网络,并采用连续视频帧作为网络输入。在训练过程中,通过计算可以估计出这些帧的深度图和相对位姿关系。然后,可以利用3维(dimension,D)与2D之间场景的相互映射关系,使得光度重构误差最小化,进而优化深度图和相对位姿关系。而在实际使用过程中,则只使用训练好的深度估计网络对输入的视频进行深度预测,得到相应的深度图。例如可以将单目深度(MonoDepth)作为基础,开发从视频序列中预测出深度信息。In some related solutions, depth prediction solutions based on monocular cameras can use self-supervised depth estimation. It can be understood that depth prediction can also be called depth estimation. For example, a depth estimation network and a relative pose estimation network can be constructed, and continuous video frames are used as network input. During the training process, the depth map and relative pose relationship of these frames can be estimated through calculation. Then, the mutual mapping relationship between the 3D (dimension) and 2D scenes can be used to minimize the photometric reconstruction error, thereby optimizing the depth map and relative pose relationship. In actual use, only the trained depth estimation network is used to predict the depth of the input video and obtain the corresponding depth map. For example, monocular depth (MonoDepth) can be used as a basis to develop prediction of depth information from video sequences.
当然,在另一些相关方案中,可以采用全监督深度估计的方式,例如构建深度网络,其使用配对的彩色图片和深度图片(即标签)作为输入。通过对彩色图片预测得到的深度图和作为标签的深度图片使用损失函数对网络参数进行更新,从而得到训练完成的深度网络。例如以从大到小(big to small,BTS)为代表,输入配对的彩色图像和深度图片(即标签)完成网络训练任务。在训练过程中,其方法可以是输入一张彩色图片,经过特征编码器、特征聚合网络、特征解码器得到深度估计结果。然后深度估计结果会与真实数据(ground truth,GT)深度计算结构相似性(structural similarity index,SSIM)损失,并进行参数更新。而在使用过程中,可以如图3所示出的,输入一张彩色图片,然后经过卷积后得到该输入图片的特征图F。基于该特征图F可以分别经过全图像编码器(full-image encoder)、卷积、空洞空间金字塔池化(atrous spatial pyramid pooling,ASPP)等操作,从不同的维度进行卷积,以得到不同维度的深度特征。然后,可以将不同维度卷积得到的深度特征叠加后再次进行卷积,如图3中的卷积x。之后,可以利用卷积x卷积后得到的特 征y采用序数回归(ordinal regression)的方式确定各个不同层次的深度图,例如可以确定出不同深度区间的深度图,如深度区间可以划分为L”>l 0、L”>l 1、……、L”>l k-1等等。以便最终将不同深度区间对应的深度图进行叠加得到最终输出的深度图。 Of course, in other related solutions, fully supervised depth estimation can be adopted, such as building a deep network that uses paired color pictures and depth pictures (i.e. labels) as input. By using the loss function to update the network parameters using the depth map predicted from the color image and the depth image as the label, the trained deep network is obtained. For example, taking Big to Small (BTS) as a representative, input paired color images and depth images (i.e. labels) to complete the network training task. During the training process, the method can be to input a color picture, and obtain the depth estimation result through the feature encoder, feature aggregation network, and feature decoder. Then the depth estimation result will be compared with the real data (ground truth, GT) depth to calculate the structural similarity index (SSIM) loss, and the parameters will be updated. During use, as shown in Figure 3, a color picture can be input, and then the feature map F of the input picture can be obtained after convolution. Based on this feature map F, we can perform convolution from different dimensions through operations such as full-image encoder, convolution, and atrous spatial pyramid pooling (ASPP) to obtain different dimensions. deep features. Then, the depth features obtained by convolution of different dimensions can be superimposed and convolved again, such as convolution x in Figure 3. After that, the feature y obtained by convolution x can be used to determine the depth map of each different level using ordinal regression. For example, the depth map of different depth intervals can be determined. For example, the depth interval can be divided into L">l 0 , L”>l 1 ,…, L”>l k-1 and so on. In order to finally superpose the depth maps corresponding to different depth intervals to obtain the final output depth map.
而在又一些相关方案中,还可以利用单目相机与结构光的方式。其中,结构光例如可以是激光雷达、毫米波雷达等。可以通过使用刚性固定的方式固定单目相机与结构光。然后,前期可以将单目相机与结构光进行标定,之后,可以通过三角测距原理得到对应像素位置的深度结果。In some related solutions, monocular cameras and structured light can also be used. Among them, structured light can be, for example, laser radar, millimeter wave radar, etc. The monocular camera and structured light can be fixed using a rigid fixation method. Then, the monocular camera and structured light can be calibrated in the early stage, and then the depth result of the corresponding pixel position can be obtained through the triangulation ranging principle.
当然,上述不同的相关方案其具体实现过程可以参考现有的方式实现,本公开不再赘述。Of course, the specific implementation processes of the above different related solutions can be implemented with reference to existing methods, and will not be described in detail in this disclosure.
然而,上述提到的相关方案中,为了尽可能提升深度预测的效果,通常会选用参数量与计算量较大的网络,以及使用与手持的终端设备设备不适配的算子,这将使得深度网络在手持的终端设备上部署和运行存在较大困难。However, in the related solutions mentioned above, in order to improve the effect of depth prediction as much as possible, networks with large parameters and calculations are usually selected, and operators that are not suitable for handheld terminal devices are used, which will make There are great difficulties in deploying and running deep networks on handheld terminal devices.
因此,本公开提供了一种图像深度预测方法,采用由多层深度可分离卷积构成的深度网络模型,对输入的图像进行深度预测。其中,深度网络模型采用目标损失函数训练得到,该目标损失函数基于误差权重参数、深度梯度误差以及深度结构损失参数中的至少一项确定。误差权重参数用于表征估计深度和标签深度之间差异的权重,深度梯度误差用于表征估计深度和标签深度之间梯度差异,深度结构损失参数用于表征训练图像中不同位置对应的标签深度差异。估计深度为训练阶段深度网络模型基于训练图像确定的,标签深度与训练图像相对应。本公开采用的深度网络模型具有深度可分离卷积,从而实现深度网络模型在终端设备上的部署,并有效节约了运行时长,避免了大型模型在终端设备上运行耗时长、不适配的情况。Therefore, the present disclosure provides an image depth prediction method that uses a deep network model composed of multi-layer depth-separable convolutions to perform depth prediction on input images. Among them, the deep network model is trained using a target loss function, which is determined based on at least one of an error weight parameter, a depth gradient error, and a depth structure loss parameter. The error weight parameter is used to characterize the weight of the difference between the estimated depth and the label depth. The depth gradient error is used to characterize the gradient difference between the estimated depth and the label depth. The depth structure loss parameter is used to characterize the difference in label depth corresponding to different positions in the training image. . The estimated depth is determined by the deep network model in the training stage based on the training image, and the label depth corresponds to the training image. The deep network model used in this disclosure has depth-separable convolution, thereby realizing the deployment of the deep network model on the terminal device, effectively saving the running time, and avoiding the long-running and unsuitable situation of large-scale models on the terminal device. .
接下来将结合附图对本公开所涉及的方案进行详细描述。Next, solutions involved in the present disclosure will be described in detail with reference to the accompanying drawings.
图4是根据一示例性实施例示出的一种深度网络模型结构示意图。可以看出,该深度网络模型的结构中可以包括至少一个编码器(encoder)、语义感知单元和至少一个解码器(decoder)。可以理解的是,该网络结构也可以是经过训练的深度网络模型。当然,也可以是未经训练的深度网络模型。也就是说,未经训练的深度网络模型与经过训练的深度网络模型的结构相同,训练并不改变模型结构。其中,语义感知单元可以包括多层深度可分离卷积,因此,相比于相关技术中ASPP的计算效率会更高,网络结构的冗余更小。在一些例子中,本公开的语义感知单元也可以认为是一种更高效的ASPP,例如可以称为高效空洞金字塔池化(efficient atrous spatial pyramid pooling,EASPP)。Figure 4 is a schematic structural diagram of a deep network model according to an exemplary embodiment. It can be seen that the structure of the deep network model may include at least one encoder (encoder), a semantic perception unit and at least one decoder (decoder). It can be understood that the network structure can also be a trained deep network model. Of course, it can also be an untrained deep network model. In other words, the structure of an untrained deep network model is the same as that of a trained deep network model, and training does not change the model structure. Among them, the semantic perception unit can include multiple layers of depth-separable convolutions. Therefore, compared with related technologies, the computing efficiency of ASPP will be higher and the redundancy of the network structure will be smaller. In some examples, the semantic perception unit of the present disclosure can also be considered as a more efficient ASPP, for example, it can be called efficient atrous spatial pyramid pooling (EASPP).
值得注意的是,本公开中可以将经过训练的深度网络模型称为深度网络模型,可以将 未经训练的深度网络模型称为初始深度模型。It is worth noting that in this disclosure, the trained deep network model can be called a deep network model, and the untrained deep network model can be called an initial deep model.
在一些例子中,编码器和解码器可以由深度可分离卷积构成。以及,EASPP也可以包括多层深度可分离卷积。编码器与解码器的数量可以相同,且一一对应。例如,在一些例子中可以包括5个编码器,因此相对应的可以有5个解码器。例如,编码器0、编码器1、编码器2、编码器3和编码器4,以及解码器0、解码器1、解码器2、解码器3和解码器4。其中,编码器0与解码器0相对应,编码器1与解码器1相对应,编码器2与解码器2相对应,编码器3与解码器3相对应,编码器4与解码器4相对应。在一些例子中,对于第一个编码器以及第一个编码器对应的解码器,可以不包括深度可分离卷积,而其余的编码器以及对应的解码器可以包括深度可分离卷积。In some examples, the encoder and decoder can be composed of depthwise separable convolutions. Also, EASPP can also include multiple layers of depthwise separable convolutions. The number of encoders and decoders can be the same and correspond one to one. For example, in some cases there may be 5 encoders and therefore 5 decoders. For example, Encoder0, Encoder1, Encoder2, Encoder3, and Encoder4, and Decoder0, Decoder1, Decoder2, Decoder3, and Decoder4. Among them, encoder 0 corresponds to decoder 0, encoder 1 corresponds to decoder 1, encoder 2 corresponds to decoder 2, encoder 3 corresponds to decoder 3, and encoder 4 corresponds to decoder 4. correspond. In some examples, depth-separable convolutions may not be included for the first encoder and the first encoder's corresponding decoder, while the remaining encoders and corresponding decoders may include depth-separable convolutions.
在一些例子中,可以将待处理图像输入至深度网络模型中,例如,可以输入到至少一个编码器中提取待处理图像的深度特征数据。在一些例子中,深度网络模型可以包括5个编码器以及5个解码器。在其它例子中,编码器、解码器的数量还可以根据实际情况更多或者更少,具体可以根据实际情况进行任意设定,本公开不做限定。可以理解,当编码器、解码器数量多于5个,则可能会降低模型运行速度,同时提高了模型的大小。而当编码器、解码器数量少于5个时,则反而会对输入图像特征提取的深度不够,使得深层次的特征信息被忽略,从而导致模型预测效果变差。因此,本公开将以深度网络模型中包括5个编码器以及5个解码器为例进行描述。In some examples, the image to be processed can be input into a deep network model, for example, the image to be processed can be input into at least one encoder to extract depth feature data of the image to be processed. In some examples, a deep network model may include 5 encoders and 5 decoders. In other examples, the number of encoders and decoders can be more or less according to the actual situation, and can be set arbitrarily according to the actual situation, which is not limited by this disclosure. It can be understood that when the number of encoders and decoders is more than 5, it may reduce the model running speed and increase the size of the model. When the number of encoders and decoders is less than 5, the depth of feature extraction from the input image is not enough, causing deep-level feature information to be ignored, resulting in poor model prediction effects. Therefore, this disclosure will be described using the example of a deep network model including 5 encoders and 5 decoders.
在一些例子中,每次输入可以是单张待处理图像。待处理图像例如可以是彩色图像。首先可以将训练图像输入至编码器0中,编码器0中可以包括两个卷积层,每个卷积层的卷积核大小可以为3*3,当训练图像经过两个卷积层进行卷积后,可以输出通道(channel)维度为16的深度特征。通过图4可以看出,编码器0输出的深度特征数据可以输入到下一个编码器,即编码器1。当然,可以看到编码器0输出的深度特征数据还可以输入至与之对应的解码器0中。可以理解,当存在多个编码器时,第一个编码器可以不具有深度可分离卷积,而其余的编码器具有深度可分离卷积。当然,在一些例子中,可以所有编码器均具有深度可分离卷积,具体可以根据实际情况进行调整,本公开不做具体限定。In some examples, each input may be a single image to be processed. The image to be processed may be a color image, for example. First, the training image can be input into encoder 0. Encoder 0 can include two convolutional layers. The convolution kernel size of each convolutional layer can be 3*3. When the training image passes through the two convolutional layers, After convolution, depth features with a channel dimension of 16 can be output. As can be seen from Figure 4, the depth feature data output by encoder 0 can be input to the next encoder, that is, encoder 1. Of course, it can be seen that the depth feature data output by encoder 0 can also be input to the corresponding decoder 0. It can be understood that when there are multiple encoders, the first encoder may not have depth-separable convolutions, while the remaining encoders have depth-separable convolutions. Of course, in some examples, all encoders may have depth-separable convolutions, and the details may be adjusted according to actual conditions, and are not specifically limited in this disclosure.
对于编码器1可以包括下采样层和深度可分离卷积层。其中,下采样层可以采用最大池化(max polling,MaxPool)2D进行下采样。其中,2D表示为图像池化时宽、高两个维度。当然,在一些例子中,在进行池化时可以按2×2矩阵大小进行池化。在一些例子中,编码器1将编码器0的输出作为输入,经过MaxPool2D下采样后,再经过深度可分离卷积层进行特征提取后,可以输出通道维度为32的深度特征。可以看出,编码器1输出的深度特征数据可以输入到下一个编码器,即编码器2。当然,还可以输入至与之对应的解码 器1中。编码器2、编码器3、编码器4与编码器1相类似,例如,编码器2可以包括下采样层和深度可分离卷积层,下采样层可以采用MaxPool2D进行下采样。编码器2将编码器1的输出作为输入,经过MaxPool2D下采样后,再经过深度可分离卷积层进行特征提取后,可以输出通道维度为64的深度特征。可以看出,编码器2输出的深度特征数据可以输入到下一个编码器,即编码器3;还可以输入至与之对应的解码器2中。编码器3可以包括下采样层和深度可分离卷积层,下采样层可以采用MaxPool2D进行下采样。编码器3将编码器2的输出作为输入,经过MaxPool2D下采样后,再经过深度可分离卷积层进行特征提取后,可以输出通道维度为128的深度特征。可以看出,编码器3输出的深度特征数据可以输入到下一个编码器,即编码器4;还可以输入至与之对应的解码器3中。编码器4可以包括下采样层和深度可分离卷积层,下采样层可以采用MaxPool2D进行下采样。编码器4将编码器3的输出作为输入,经过MaxPool2D下采样后,再经过深度可分离卷积层进行特征提取后,可以输出通道维度为192的深度特征。可以看出,编码器4输出的深度特征数据可以输入到语义感知单元,即EASPP;还可以输入至与之对应的解码器4中。可以理解的是,各编码器输出的深度特征数据的通道维度数可以经由各编码器中包含的传统卷积层进行维度提升得到。具体的维度数可以根据实际情况进行任意调整,本公开不做限定。For encoder 1, downsampling layers and depthwise separable convolutional layers can be included. Among them, the downsampling layer can use maximum pooling (max polling, MaxPool) 2D for downsampling. Among them, 2D is represented by two dimensions: width and height during image pooling. Of course, in some examples, pooling can be performed by a 2×2 matrix size when performing pooling. In some examples, Encoder 1 takes the output of Encoder 0 as input, and after MaxPool2D downsampling and depth-separable convolutional layers for feature extraction, it can output depth features with a channel dimension of 32. It can be seen that the depth feature data output by encoder 1 can be input to the next encoder, that is, encoder 2. Of course, it can also be input to the corresponding decoder 1. Encoder 2, Encoder 3, and Encoder 4 are similar to Encoder 1. For example, Encoder 2 may include a downsampling layer and a depth-separable convolutional layer, and the downsampling layer may use MaxPool2D for downsampling. Encoder 2 takes the output of Encoder 1 as input. After MaxPool2D downsampling and depth-separable convolutional layers for feature extraction, it can output depth features with a channel dimension of 64. It can be seen that the depth feature data output by encoder 2 can be input to the next encoder, that is, encoder 3; it can also be input to the corresponding decoder 2. Encoder 3 may include a downsampling layer and a depth-separable convolutional layer, and the downsampling layer may use MaxPool2D for downsampling. Encoder 3 takes the output of Encoder 2 as input. After MaxPool2D downsampling and depth-separable convolutional layers for feature extraction, it can output depth features with a channel dimension of 128. It can be seen that the depth feature data output by encoder 3 can be input to the next encoder, that is, encoder 4; it can also be input to the corresponding decoder 3. Encoder 4 may include a downsampling layer and a depth-separable convolutional layer, and the downsampling layer may use MaxPool2D for downsampling. Encoder 4 takes the output of Encoder 3 as input. After MaxPool2D downsampling and depth-separable convolutional layers for feature extraction, it can output depth features with a channel dimension of 192. It can be seen that the depth feature data output by the encoder 4 can be input to the semantic perception unit, that is, EASPP; it can also be input to the corresponding decoder 4. It can be understood that the channel dimensions of the depth feature data output by each encoder can be obtained by dimensionally upgrading the traditional convolutional layer included in each encoder. The specific number of dimensions can be adjusted arbitrarily according to the actual situation, and is not limited in this disclosure.
可以理解,在一些例子中,无论采用多少数量的编码器,最后一个编码器输出的深度特征数据都将输入至语义感知单元中。It can be understood that in some examples, no matter how many encoders are used, the depth feature data output by the last encoder will be input to the semantic perception unit.
语义感知单元可以根据最后一个编码器的输出进行语义提取,并输出包含丰富语义信息的深度语义数据。例如图5所示的,示出了一种可能的语义感知单元结构。可以看出,语义感知单元中可以包括至少一层深度可分离卷积和融合层。其中,深度可分离卷积具有膨胀系数。语义感知单元中不同的深度可分离卷积对应的膨胀系数可以相同或者不同,具体每个深度可分离卷积的膨胀系数可以根据实际情况进行任意调整,本公开不做限定。The semantic perception unit can perform semantic extraction based on the output of the last encoder and output deep semantic data containing rich semantic information. For example, as shown in Figure 5, a possible semantic perception unit structure is shown. It can be seen that at least one depthwise separable convolution and fusion layer can be included in the semantic perception unit. Among them, depthwise separable convolution has an expansion coefficient. The expansion coefficients corresponding to different depth-separable convolutions in the semantic sensing unit can be the same or different. The expansion coefficient of each depth-separable convolution can be adjusted arbitrarily according to the actual situation, and this disclosure does not limit it.
在一些例子中,可以包括5个深度可分离卷积。其中,深度可分离卷积1以最后一个编码器(即图4中的编码器4)输出的深度特征数据作为输入。深度可分离卷积1的膨胀系数可以为3,深度可分离卷积1对最后一个编码器输出的深度特征数据进行语义提取后,将得到的深度语义数据传输至下一个深度可分离卷积,即深度可分离卷积2,以及还可以将得到的深度语义数据传输至融合层。深度可分离卷积2将深度可分离卷积1的输出与最后一个编码器输出的深度特征数据进行融合后的数据作为输入,深度可分离卷积2的膨胀系数可以为6,深度可分离卷积2对融合后的数据进行语义提取后,将得到的深度语义数据传输至下一个深度可分离卷积,即深度可分离卷积3,以及还可以将得到的深度语义数 据传输至融合层。深度可分离卷积3将深度可分离卷积2的输出与深度可分离卷积2的输入同时作为输入,深度可分离卷积3的膨胀系数可以为12,深度可分离卷积3对输入的数据进行语义提取后,将得到的深度语义数据传输至下一个深度可分离卷积,即深度可分离卷积4,以及还可以将得到的深度语义数据传输至融合层。深度可分离卷积4、深度可分离卷积5与深度可分离卷积3相类似,深度可分离卷积4将深度可分离卷积3的输出与深度可分离卷积2的输入同时作为输入,深度可分离卷积4的膨胀系数可以为18,深度可分离卷积4对输入的数据进行语义提取后,将得到的深度语义数据传输至下一个深度可分离卷积,即深度可分离卷积5,以及还可以将得到的深度语义数据传输至融合层。深度可分离卷积5将深度可分离卷积4的输出与深度可分离卷积2的输入同时作为输入,深度可分离卷积5的膨胀系数可以为24,深度可分离卷积5对输入的数据进行语义提取后,将得到的深度语义数据输入至融合层。融合层将各个深度可分离卷积的输出以及最后一个编码器(即图4中的编码器4)输出的深度特征数据进行融合,以确定出最终的深度语义数据,并作为语义感知单元的输出。语义感知单元将确定出最终的深度语义数据输入至最后一个编码器相对应的解码器中,例如可以为图4中的解码器4。可以理解的是,语义感知单元中除第一个深度可分离卷积以外的其它深度可分离卷积,其输入不仅包含上一个深度可分离卷积的输出,同时还包括第一个深度可分离卷积的输出,或最后一个编码器的输出,从而保证可以提取到更加丰富的语义信息。图5中的融合层可以代表各深度可分离卷积输出数据的通道维度的合并(concat)操作。可以理解,该concat表示相同通道维度上的concat操作,也就是说,语义感知单元中数据处理并不会改变输出的通道维度。In some examples, 5 depthwise separable convolutions may be included. Among them, depth-separable convolution 1 takes the depth feature data output by the last encoder (ie, encoder 4 in Figure 4) as input. The expansion coefficient of depth-separable convolution 1 can be 3. After depth-separable convolution 1 performs semantic extraction on the depth feature data output by the last encoder, the obtained depth semantic data is transferred to the next depth-separable convolution. That is, depthwise separable convolution 2, and the resulting depth semantic data can also be transferred to the fusion layer. Depthwise separable convolution 2 fuses the output of depthwise separable convolution 1 with the depth feature data output by the last encoder as input. The expansion coefficient of depthwise separable convolution 2 can be 6, and the depthwise separable convolution 2 can have an expansion coefficient of 6. After Product 2 performs semantic extraction on the fused data, the obtained depth semantic data is transferred to the next depth-separable convolution, namely depth-separable convolution 3, and the obtained depth semantic data can also be transferred to the fusion layer. Depthwise separable convolution 3 uses the output of depthwise separable convolution 2 and the input of depthwise separable convolution 2 as input at the same time. The expansion coefficient of depthwise separable convolution 3 can be 12, and depthwise separable convolution 3 has a After the data is semantically extracted, the obtained depth semantic data is transferred to the next depthwise separable convolution, namely depthwise separable convolution 4, and the obtained depth semantic data can also be transferred to the fusion layer. Depthwise separable convolution 4 and depthwise separable convolution 5 are similar to depthwise separable convolution 3. Depthwise separable convolution 4 uses the output of depthwise separable convolution 3 and the input of depthwise separable convolution 2 as input at the same time. , the expansion coefficient of depthwise separable convolution 4 can be 18. After depthwise separable convolution 4 performs semantic extraction on the input data, the obtained depth semantic data is transferred to the next depthwise separable convolution, that is, depthwise separable convolution. Product 5, and the resulting deep semantic data can also be transferred to the fusion layer. Depthwise separable convolution 5 uses the output of depthwise separable convolution 4 and the input of depthwise separable convolution 2 as input at the same time. The expansion coefficient of depthwise separable convolution 5 can be 24, and the depthwise separable convolution 5 has After semantic extraction of the data, the obtained deep semantic data is input to the fusion layer. The fusion layer fuses the output of each depth-separable convolution and the depth feature data output by the last encoder (ie, encoder 4 in Figure 4) to determine the final depth semantic data and serve as the output of the semantic perception unit . The semantic sensing unit inputs the determined final depth semantic data into the decoder corresponding to the last encoder, which may be decoder 4 in Figure 4, for example. It can be understood that the input of other depth-separable convolutions in the semantic perception unit except the first depth-separable convolution not only includes the output of the previous depth-separable convolution, but also includes the first depth-separable convolution. The output of the convolution, or the output of the last encoder, ensures that richer semantic information can be extracted. The fusion layer in Figure 5 can represent a concat operation of the channel dimensions of each depthwise separable convolution output data. It can be understood that the concat represents a concat operation on the same channel dimension, that is to say, the data processing in the semantic perception unit does not change the output channel dimension.
继续回到图4,解码器4将根据语义感知单元的输出以及编码器4的输出进行解码。其中,解码器4可以设置步(stride)为8,stride为解码器4的一个超参数。解码器4将输出stride=8的特征图。其中,该特征图的通道维度为128。解码器4将该特征图传输至下一个解码器,即解码器3。解码器3将根据解码器4的输出以及编码器3的输出进行解码。其中,解码器3可以设置stride为4。解码器3将输出stride=4的特征图。其中,该特征图的通道维度为64。解码器3将该特征图传输至下一个解码器,即解码器2。解码器2将根据解码器3的输出以及编码器2的输出进行解码。其中,解码器2可以设置stride为2。解码器2将输出stride=2的特征图。其中,该特征图的通道维度为32。解码器2将该特征图传输至下一个解码器,即解码器1。解码器1将根据解码器2的输出以及编码器1的输出进行解码。其中,解码器1可以设置stride为1。解码器1将输出stride=1的特征图。其中,该特征图的通道维度为16。解码器1将该特征图传输至下一个解码器,即解码器0。Continuing back to Figure 4, decoder 4 will perform decoding based on the output of the semantic perception unit and the output of encoder 4. Among them, decoder 4 can set stride to 8, and stride is a hyperparameter of decoder 4. Decoder 4 will output the feature map with stride=8. Among them, the channel dimension of this feature map is 128. Decoder 4 transmits this feature map to the next decoder, decoder 3. Decoder 3 will decode based on the output of decoder 4 and the output of encoder 3. Among them, decoder 3 can set stride to 4. Decoder 3 will output the feature map with stride=4. Among them, the channel dimension of this feature map is 64. Decoder 3 transmits this feature map to the next decoder, Decoder 2. Decoder 2 will decode based on the output of Decoder 3 and the output of Encoder 2. Among them, decoder 2 can set stride to 2. Decoder 2 will output the feature map with stride=2. Among them, the channel dimension of this feature map is 32. Decoder 2 transmits this feature map to the next decoder, Decoder 1. Decoder 1 will decode based on the output of Decoder 2 and the output of Encoder 1. Among them, decoder 1 can set stride to 1. Decoder 1 will output the feature map with stride=1. Among them, the channel dimension of this feature map is 16. Decoder 1 transmits this feature map to the next decoder, decoder 0.
解码器0将根据解码器1的输出以及编码器0的输出进行解码。解码器1将输出对输 入训练图像的深度估计结果,即训练图像对应的深度估计数据。其中,该深度估计数据可以是通道维度为1的深度图。Decoder 0 will decode based on the output of decoder 1 and the output of encoder 0. Decoder 1 will output the depth estimation result of the input training image, that is, the depth estimation data corresponding to the training image. The depth estimation data may be a depth map with a channel dimension of 1.
可以理解的是,解码器中也可以采用深度可分离卷积,与编码器相对应。因此,在一些例子中,最后一个解码器(即与第一个编码器相对应的解码器)也可以不包含深度可分离卷积。解码器的结构可以与编码器相类似,可以用于执行编码器的反向操作。It can be understood that depthwise separable convolution can also be used in the decoder, corresponding to the encoder. Therefore, in some examples, the last decoder (i.e., the decoder corresponding to the first encoder) may also not contain depthwise separable convolutions. The structure of the decoder can be similar to that of the encoder and can be used to perform the reverse operation of the encoder.
之后,深度网络模型中最后一个解码器(即图4中的解码器0)可以输出待训练图像的图像深度。After that, the last decoder in the deep network model (ie, decoder 0 in Figure 4) can output the image depth of the image to be trained.
在一些例子中,对于处于训练阶段的深度网络模型,即初始深度模型。其输入可以是训练图像,并得到估计深度。可以理解,估计深度表示训练图像经过初始深度模型后输出的训练图像的图像深度。可以理解,在训练阶段,每个训练图像可以对应有标签深度,该标签深度用于表示对应训练图像的真实图像深度。则每次训练时,初始深度模型可以根据估计深度与标签深度计算损失函数,并基于该损失函数对初始深度模型中各参数进行调整。例如可以采用随机梯度下降(stochastic gradient descent,SGD)进行梯度反向传播,以更新初始深度模型中的参数。In some examples, for deep network models in the training stage, that is, the initial deep model. Its input can be training images and the estimated depth is obtained. It can be understood that the estimated depth represents the image depth of the training image output after the training image passes through the initial depth model. It can be understood that in the training stage, each training image can correspond to a label depth, and the label depth is used to represent the real image depth of the corresponding training image. Then during each training, the initial depth model can calculate the loss function based on the estimated depth and label depth, and adjust each parameter in the initial depth model based on the loss function. For example, stochastic gradient descent (SGD) can be used for gradient backpropagation to update the parameters in the initial depth model.
当损失函数收敛后,可以得到训练好的深度网络模型。可以理解,训练阶段的数据处理过程与应用阶段相类似,因此,训练阶段数据处理具体可以参考应用阶段数据处理的相应描述,本公开不再赘述。When the loss function converges, the trained deep network model can be obtained. It can be understood that the data processing process in the training phase is similar to that in the application phase. Therefore, for specific data processing in the training phase, reference can be made to the corresponding description of data processing in the application phase, which will not be described in detail in this disclosure.
本申请中的深度网络模型由于采用了深度可分离卷积,因此可以适应终端设备的场景,完美部署在终端设备上,以便终端设备可以基于训练好的深度网络模型对图像进行深度估计。The deep network model in this application uses depth-separable convolution, so it can adapt to the scene of the terminal device and be perfectly deployed on the terminal device, so that the terminal device can perform depth estimation of the image based on the trained deep network model.
接下来将结合附图对上述过程进行更详细的描述。Next, the above process will be described in more detail with reference to the accompanying drawings.
图6是根据一示例性实施例示出的一种图像深度预测方法流程图。如图6所示,该方法可以运行在终端设备上。其中,终端设备也可以称为终端、用户设备(User Equipment,UE)、移动台(Mobile Station,MS)、移动终端(Mobile Terminal,MT)等,是一种向用户提供语音和/或数据连通性的设备,例如,终端设备可以是具有无线连接功能的手持式设备、车载设备等。目前,一些终端设备的举例包括智能手机(Mobile Phone)、口袋计算机(Pocket Personal Computer,PPC)、掌上电脑、个人数字助理(Personal Digital Assistant,PDA)、笔记本电脑、平板电脑、可穿戴设备、或者车载设备等。此外,当为车联网(V2X)通信系统时,终端设备还可以是车载设备。应理解,本公开实施例对终端设备所采用的具体技术和具体设备形态不做限定。Figure 6 is a flow chart of an image depth prediction method according to an exemplary embodiment. As shown in Figure 6, this method can be run on the terminal device. Among them, terminal equipment can also be called terminal, user equipment (User Equipment, UE), mobile station (Mobile Station, MS), mobile terminal (Mobile Terminal, MT), etc. It is a type of device that provides voice and/or data connectivity to users. For example, the terminal device can be a handheld device with wireless connection function, a vehicle-mounted device, etc. Currently, some examples of terminal devices include smartphones (Mobile Phone), pocket computers (Pocket Personal Computer, PPC), PDAs, personal digital assistants (Personal Digital Assistant, PDA), notebook computers, tablet computers, wearable devices, or Vehicle equipment, etc. In addition, when it is a vehicle-to-everything (V2X) communication system, the terminal device may also be a vehicle-mounted device. It should be understood that the embodiments of the present disclosure do not limit the specific technology and specific equipment form used by the terminal equipment.
可以理解,该方法中所涉及到的深度网络模型可以采用上述图4、图5所描述的网络 结构。在一些例子中,本公开可以使用单目视觉系统实现2D图像深度估计任务,其输入只是单张彩色图片输出是以灰度值为代表的深度图。在一些例子中,该方法还可以扩展到计算摄影、自动驾驶态势感知等任务中。It can be understood that the deep network model involved in this method can adopt the network structure described in Figure 4 and Figure 5 above. In some examples, the present disclosure can use a monocular vision system to implement a 2D image depth estimation task, the input of which is only a single color picture and the output is a depth map represented by grayscale values. In some examples, this method can also be extended to tasks such as computational photography and autonomous driving situation awareness.
因此,该方法可以包括以下步骤:Therefore, the method may include the following steps:
在步骤S11中,获取待处理图像。In step S11, the image to be processed is obtained.
在一些例子中,终端设备可以获取需要预测深度的待处理图像。其中,待处理图像可以通过网络的方式从其它设备上获取到,也可以是终端设备通过拍摄得到,还可以是预先存储在终端设备上的,本公开不作限定。In some examples, the terminal device may obtain an image to be processed that requires depth prediction. The image to be processed can be obtained from other devices through the network, or can be obtained by photographing the terminal device, or can be pre-stored on the terminal device, which is not limited by this disclosure.
在一些例子中,网络例如可以采用码分多址(Code Division Multiple Access,CDMA)、宽带码分多址(Wideband Code Division Multiple Access,WCDMA)、时分多址(Time Division Multiple Access,TDMA)、频分多址(Frequency Division Multiple Access,FDMA)、正交频分多址(Orthogonal Frequency-Division Multiple Access,OFDMA)、单载波频分多址(Single Carrier FDMA,SC-FDMA)、载波侦听多路访问/冲突避免(Carrier Sense Multiple Access with Collision Avoidance)等方式实现。根据不同网络的容量、速率、时延等因素可以将网络分为2G(英文:Generation)网络、3G网络、4G网络或者未来演进网络,如第五代无线通信系统(The 5th Generation Wireless Communication System,5G)网络,5G网络也可称为是新无线网络(New Radio,NR)。In some examples, the network may adopt Code Division Multiple Access (CDMA), Wideband Code Division Multiple Access (WCDMA), Time Division Multiple Access (TDMA), frequency Frequency Division Multiple Access (FDMA), Orthogonal Frequency-Division Multiple Access (OFDMA), Single Carrier Frequency Division Multiple Access (Single Carrier FDMA, SC-FDMA), Carrier Sense Multiple Access Access/conflict avoidance (Carrier Sense Multiple Access with Collision Avoidance) and other methods are implemented. According to different network capacity, speed, delay and other factors, the network can be divided into 2G (English: Generation) network, 3G network, 4G network or future evolution network, such as the fifth generation wireless communication system (The 5th Generation Wireless Communication System, 5G) network, 5G network can also be called New Radio (NR).
在步骤S12中,将待处理图像输入至深度网络模型,预测待处理图像的图像深度。In step S12, the image to be processed is input to the deep network model, and the image depth of the image to be processed is predicted.
在一些例子中,可以将S12中获取到的待处理图像输入至深度网络模型中,从而得到对待处理图像预测的图像深度。In some examples, the image to be processed obtained in S12 can be input into the deep network model to obtain the predicted image depth of the image to be processed.
其中,深度网络模型采用目标损失函数训练得到,目标损失函数基于误差权重参数、深度梯度误差以及深度结构损失参数中的至少一项确定。误差权重参数用于表征估计深度和标签深度之间差异的权重;深度梯度误差用于表征估计深度和标签深度之间梯度差异;深度结构损失参数用于表征训练图像中不同位置对应的标签深度差异。Among them, the deep network model is trained using a target loss function, and the target loss function is determined based on at least one of the error weight parameter, the depth gradient error, and the depth structure loss parameter. The error weight parameter is used to characterize the weight of the difference between the estimated depth and the label depth; the depth gradient error is used to characterize the gradient difference between the estimated depth and the label depth; the depth structure loss parameter is used to characterize the difference in label depth corresponding to different positions in the training image .
本公开采用的深度网络模型具有深度可分离卷积,从而实现深度网络模型在终端设备上的部署,并有效节约了运行时长,避免了大型模型在终端设备上运行耗时长、不适配的情况。The deep network model used in this disclosure has depth-separable convolution, thereby realizing the deployment of the deep network model on the terminal device, effectively saving the running time, and avoiding the long-running and unsuitable situation of large-scale models on the terminal device. .
在一些实施例中,对于图4至图6中所涉及到的深度网络模型,其训练过程可以例如图7所示出的,图7是根据一示例性实施例示出的一种深度网络模型训练方法流程图。该方法可以运行在服务设备上。在一些例子中,服务设备可以是服务器或者服务器集群。当然,在另一些例子中,还可以是运行在虚拟机上的服务器或服务器集群。In some embodiments, for the deep network model involved in Figures 4 to 6, the training process can be as shown in Figure 7, for example. Figure 7 is a deep network model training according to an exemplary embodiment. Method flowchart. This method can be run on the service device. In some examples, the service device may be a server or server cluster. Of course, in other examples, it can also be a server or server cluster running on a virtual machine.
因此,该方法可以包括以下步骤:Therefore, the method may include the following steps:
在步骤S21中,获取预先配置的训练图像集,训练图像集包括训练图像以及训练图像对应的标签深度。In step S21, a preconfigured training image set is obtained. The training image set includes training images and label depths corresponding to the training images.
在一些例子中,服务设备可以获取预先配置的训练图像集。该训练图像集可以是预先存储在服务设备上的。或者训练图像集可以存储在某个数据库中,服务设备通过与相应的数据进行连接,以获取到训练图像集。其中,训练图像集包括训练图像以及训练图像对应的标签深度。In some examples, the service device may obtain a preconfigured set of training images. The training image set may be pre-stored on the service device. Or the training image set can be stored in a database, and the service device obtains the training image set by connecting with the corresponding data. Among them, the training image set includes training images and the label depth corresponding to the training images.
在一些例子中,训练图像集的生成,可以通过网络数据采集、光流估计、双目立体匹配、深度估计教师(teacher)模型预测方式,获得大量稠密的深度数据。可以理解,稠密表示可以通过上述方式获取到训练图像中每个像素点确定对应的标签深度。In some examples, a large amount of dense depth data can be obtained through network data collection, optical flow estimation, binocular stereo matching, and depth estimation teacher model prediction methods to generate training image sets. It can be understood that the dense representation can be obtained through the above method to determine the corresponding label depth for each pixel in the training image.
在一些例子中,训练图像集中的图像可以是针对设定区域进行掩膜处理后的图像。例如,设定区域可以是图像的背景区域,如天空区域、海洋区域等。以设定区域为天空区域为例,对于涉及到户外场景的训练图像,此类训练图像通常包含有天空。天空部分色彩、云彩等可能会对深度估计产生相应的影响,例如错误估计云彩的深度等。因此,还可以采用预先训练好的天空分割模型对天空区域进行分割,以得到天空掩膜S mask。之后,可以利用S mask对包含天空的图像进行处理,以便对天空区域标注相应的标签深度。例如可以设定图像中S mask处理区域的深度为最大值,表示深度最远。通过对训练图像中天空部分进行标注,从而提高了深度预测时对天空区域深度的估计准确性。又例如,在采用双目立体匹配的方式获取训练图像集时,对于可能存在部分无效区域,因此可以采用有效区域掩膜对双目立体匹配方式获取的图像进行处理,以得到处理后的图像。可以采用V mask表示有效区域掩膜。通过V mask对图像进行处理后,得到的图像均为有效区域图像,从而有效避免了不在该区域中的像素(即无效区域中的像素)参与后续训练过程中相应的计算,例如损失计算等。可以理解的是,经过V mask处理的图像也可以经过S mask处理。也就是说,有效区域图像中可能包括进行标注的背景区域。之后,可以采用经过掩膜处理后的图像作为训练图像进行后续的训练。 In some examples, the images in the training image set may be images that have been masked for a set area. For example, the setting area can be the background area of the image, such as the sky area, ocean area, etc. Taking the setting area as the sky area as an example, for training images involving outdoor scenes, such training images usually include the sky. The color of some parts of the sky, clouds, etc. may have a corresponding impact on depth estimation, such as incorrect estimation of the depth of clouds. Therefore, a pre-trained sky segmentation model can also be used to segment the sky area to obtain the sky mask S mask . Afterwards, S mask can be used to process the image containing the sky to mark the sky area with the corresponding label depth. For example, you can set the depth of the S mask processing area in the image to the maximum value, indicating the furthest depth. By annotating the sky part in the training image, the accuracy of estimating the depth of the sky area during depth prediction is improved. For another example, when binocular stereo matching is used to obtain a training image set, there may be some invalid areas. Therefore, the effective area mask can be used to process the image obtained by binocular stereo matching to obtain a processed image. V mask can be used to represent the effective area mask. After the image is processed through V mask , the resulting images are all valid area images, thus effectively preventing pixels not in the area (ie, pixels in the invalid area) from participating in corresponding calculations in the subsequent training process, such as loss calculations. It can be understood that the image processed by V mask can also be processed by S mask . In other words, the effective area image may include annotated background areas. Afterwards, the masked image can be used as a training image for subsequent training.
在又一些例子中,由于目前的深度预测方案通常会集中于某一个相对固定的场景,使得训练后的模型在脱离该场景时会存在性能显著下降的情况,即泛化能力很差。因此,在获取的训练图像集可以包括任意场景下的训练图像,以使得训练得到的深度网络模型泛化能力得到显著提升。In some cases, because current depth prediction solutions usually focus on a relatively fixed scene, the performance of the trained model will significantly decrease when leaving the scene, that is, the generalization ability is very poor. Therefore, the acquired training image set can include training images in any scene, so that the generalization ability of the trained deep network model is significantly improved.
因此,在一些例子中,可以根据不同时间、不同采集方式划分出不同的训练图像集,例如{Data 1,Data 2,…,Data n}。Data n表示第n个训练图像集。之后,可以基于每个 训练图像集包含数据量的不同,对不同的训练图像集设置不同的采样权重。可以例如公式1所示出的, Therefore, in some examples, different training image sets can be divided according to different times and different collection methods, such as {Data 1 , Data 2 ,..., Data n }. Data n represents the nth training image set. Afterwards, different sampling weights can be set for different training image sets based on the amount of data contained in each training image set. It can be, for example, as shown in Equation 1,
Figure PCTCN2022099713-appb-000001
Figure PCTCN2022099713-appb-000001
其中,p j表示在训练时对第j个训练图像集的采样权重。N()表示对训练图像集中样本的计数统计,即可以理解为该训练图像集中的数据量。 Among them, p j represents the sampling weight of the j-th training image set during training. N() represents the counting statistics of samples in the training image set, which can be understood as the amount of data in the training image set.
在一些例子中,可以基于公式1计算得到的采样权重从相应的训练图像集中获取训练图像以及训练图像对应的标签深度。从而保证训练过程并不会偏向于某一类训练图像,使得训练得到的深度网络模型的泛化能力极强,可以适用于多种不同场景图像的深度估计。In some examples, the training image and the label depth corresponding to the training image can be obtained from the corresponding training image set based on the sampling weight calculated by Formula 1. This ensures that the training process will not be biased towards a certain type of training images, making the trained deep network model extremely capable of generalization and can be applied to depth estimation of a variety of different scene images.
在一些例子中,还可以通过对图像的水平翻转、随机裁剪、色彩变化等数据增广方式对训练图像集中的图像进行变化,从而达到对数据集中数据量的扩充,以满足训练需求。In some examples, the images in the training image set can also be changed through data augmentation methods such as horizontal flipping of images, random cropping, and color changes, thereby expanding the amount of data in the data set to meet training needs.
在步骤S22中,将训练图像输入至初始深度模型,确定训练图像对应的估计深度。In step S22, the training image is input to the initial depth model, and the estimated depth corresponding to the training image is determined.
在一些例子中,服务设备可以将S21中获取到的训练图像集中的训练图像输入至由深度可分离卷积构成的初始深度模型中进行训练,得到训练图像对应的估计深度。在一些例子中,可以是将基于采样权重获取到的各训练图像集中的训练图像,依次输入初始深度模型中进行训练。In some examples, the service device can input the training images in the training image set obtained in S21 into an initial depth model composed of depth-separable convolutions for training, and obtain the estimated depth corresponding to the training images. In some examples, the training images in each training image set obtained based on the sampling weight may be sequentially input into the initial depth model for training.
在一些例子中,输入初始深度模型的训练图像,其输出结果可以记为p。之后,可以对p进行范围裁剪、之后再乘上255从而得到训练图像对应的估计深度。其中,估计深度可以记为D pred=clip(p,0,1)*255。其中,clip()表示为裁剪。可以理解,输出结果可以记为p通常为0到1之间的数值,为了更方便对输出的深度进行观测,可以通过裁剪再乘上255的方式,得到各像素点的相对深度结果D。该相对深度结果D则可以作为估计深度。其中,具体计算过程可以参考现有方式实现,本公开不再赘述。 In some examples, the training images of the initial depth model are input, and the output results can be recorded as p. Afterwards, p can be range-cropped and then multiplied by 255 to obtain the estimated depth corresponding to the training image. Among them, the estimated depth can be recorded as D pred =clip(p, 0, 1)*255. Among them, clip() means cropping. It can be understood that the output result can be recorded as p, which is usually a value between 0 and 1. In order to more conveniently observe the output depth, the relative depth result D of each pixel can be obtained by cropping and multiplying by 255. The relative depth result D can be used as the estimated depth. The specific calculation process can be implemented with reference to existing methods, and will not be described in detail in this disclosure.
在步骤S23中,基于估计深度和标签深度,采用损失函数对初始深度模型进行调整。In step S23, a loss function is used to adjust the initial depth model based on the estimated depth and the label depth.
在一些例子中,服务设备可以基于S22得到的估计深度和相应训练图像对应的标签深度,计算损失函数。并基于计算得到的损失函数对初始深度模型进行调整。In some examples, the service device may calculate the loss function based on the estimated depth obtained in S22 and the label depth corresponding to the corresponding training image. And adjust the initial depth model based on the calculated loss function.
在一些例子中,可以采用SGD的方式完成初始深度模型的调整,即更新初始深度模型中相应的参数。可以理解的是,模型中的超参数并不会在训练过程中进行调整、更新。In some examples, SGD can be used to adjust the initial depth model, that is, update the corresponding parameters in the initial depth model. It is understandable that the hyperparameters in the model will not be adjusted or updated during the training process.
在步骤S24中,直至损失函数收敛,得到训练好的深度网络模型。In step S24, until the loss function converges, the trained deep network model is obtained.
在一些例子中,当S23中的损失函数收敛后,可以得到训练好的深度网络模型。可以 理解的是,该深度网络模型由于采用深度可分离卷积构成,因此适用于部署在终端设备上,以便终端设备基于深度网络模型对图像进行深度估计。In some examples, when the loss function in S23 converges, the trained deep network model can be obtained. It can be understood that since the deep network model is composed of depth-separable convolutions, it is suitable for deployment on terminal devices, so that the terminal devices can perform depth estimation of images based on the deep network model.
本公开训练得到的深度网络模型具有深度可分离卷积,从而实现深度网络模型在终端设备上的部署,并有效节约了运行时长,避免了大型模型在终端设备上运行耗时长、不适配的情况。The deep network model obtained by training in this disclosure has depth separable convolution, thereby realizing the deployment of the deep network model on the terminal device, effectively saving the running time, and avoiding the long and unsuitable problems of large models running on the terminal device. Condition.
在一些实施例中,例如图8所示出的,图6、图7中所涉及的损失函数可以通过以下步骤确认:In some embodiments, such as shown in Figure 8, the loss functions involved in Figures 6 and 7 can be confirmed through the following steps:
在步骤S31中,根据第一损失函数、第二损失函数和第三损失函数中的至少一项,确定目标损失函数。In step S31, the target loss function is determined based on at least one of the first loss function, the second loss function, and the third loss function.
在一些例子中,可以根据第一损失函数、第二损失函数和第三损失函数中的至少一项,确定出用于调整初始深度模型的目标损失函数。其中第一损失函数可以基于误差权重参数确定,第二损失函数可以基于深度梯度误差确定,第三损失函数可以基于深度结构损失参数确定。In some examples, the target loss function for adjusting the initial depth model may be determined based on at least one of the first loss function, the second loss function, and the third loss function. The first loss function can be determined based on the error weight parameter, the second loss function can be determined based on the depth gradient error, and the third loss function can be determined based on the depth structure loss parameter.
本公开可以通过多个损失函数相结合,以用于对深度网络模型训练,从而保证了经过训练的深度网络模型可以更加准确地识别出图像深度。The present disclosure can be combined with multiple loss functions to train a deep network model, thereby ensuring that the trained deep network model can more accurately identify image depth.
在一些实施例中,如图9所示出的,S31中所涉及到的第一损失函数可以通过以下步骤确定:In some embodiments, as shown in Figure 9, the first loss function involved in S31 can be determined through the following steps:
在步骤S41中,根据估计深度和标签深度,确定误差绝对值。In step S41, the absolute value of the error is determined based on the estimated depth and the tag depth.
在一些例子中,可以根据初始深度模型预测出的估计深度和相应训练图像对应的标签深度,确定误差绝对值。例如可以记作abs(D pred-D target)。其中,abs()表示为绝对值,D pred表示为初始深度模型预测出的估计深度,D target表示为相应训练图像对应的标签深度。 In some examples, the absolute value of the error can be determined based on the estimated depth predicted by the initial depth model and the label depth corresponding to the corresponding training image. For example, it can be recorded as abs(D pred -D target ). Among them, abs() represents the absolute value, D pred represents the estimated depth predicted by the initial depth model, and D target represents the label depth corresponding to the corresponding training image.
在步骤S42中,根据误差绝对值以及误差权重参数,确定第一损失函数。In step S42, the first loss function is determined based on the absolute value of the error and the error weight parameter.
在一些例子中,可以基于S21中确定的误差绝对值和S22确定的误差权重参数W,确定第一损失函数。In some examples, the first loss function may be determined based on the error absolute value determined in S21 and the error weight parameter W determined in S22.
其中,误差权重参数W可以根据估计深度和标签深度确定。例如,可以根据初始深度模型预测出的估计深度和相应训练图像对应的标签深度,确定误差权重参数W。例如可以通过公式2计算得到W。Among them, the error weight parameter W can be determined based on the estimated depth and label depth. For example, the error weight parameter W can be determined based on the estimated depth predicted by the initial depth model and the label depth corresponding to the corresponding training image. For example, W can be calculated by formula 2.
W=pow(D pred-D target,2)                   ……公式2 W=pow(D pred -D target , 2)...Formula 2
其中,pow(D pred-D target,2)表示为计算D pred-D target的2次幂。可以理解,该权重值可以对那些预测错误偏差比较大的区域施加更大损失权重。因此,也可以认为公式2表现了一种聚焦(focal)属性。 Among them, pow(D pred -D target , 2) is expressed as the calculation of the 2nd power of D pred -D target . It can be understood that this weight value can impose greater loss weight on those areas where the prediction error deviation is relatively large. Therefore, formula 2 can also be considered to express a focal attribute.
因此,在一些例子中,当得到误差绝对值和误差权重参数W之后,可以通过公式3和公式4计算得到第一损失函数。Therefore, in some examples, after obtaining the absolute value of the error and the error weight parameter W, the first loss function can be calculated through Formula 3 and Formula 4.
Figure PCTCN2022099713-appb-000002
Figure PCTCN2022099713-appb-000002
L 1=V mask·abs(D pred-D target)                  ……公式4 L 1 =V mask ·abs(D pred -D target )...Formula 4
其中,
Figure PCTCN2022099713-appb-000003
表示为第一损失函数,也可以称为focal-L 1损失函数,α、β为预先设定的权重系数。可以看出,采用误差绝对值可以计算出L 1损失函数,W·L 1则表示具有focal属性的L 1损失函数。在一些例子中,计算L 1损失函数时,可以基于一个有效区域掩膜或天空掩膜计算,例如,V mask表示该有效区域内各像素点是否有效,如可以采用0、1区分有效和无效。abs(D pred-D target)表示相同区域内各像素点的误差绝对值。当然,在一些例子中,也可以逐个像素进行计算,本公开不做限定。
in,
Figure PCTCN2022099713-appb-000003
Represented as the first loss function, which can also be called focal-L 1 loss function, α and β are preset weight coefficients. It can be seen that the L 1 loss function can be calculated using the absolute value of the error, and W·L 1 represents the L 1 loss function with focal attributes. In some examples, when calculating the L 1 loss function, it can be calculated based on an effective area mask or sky mask. For example, V mask indicates whether each pixel in the effective area is effective. For example, 0 and 1 can be used to distinguish effective and invalid. . abs(D pred -D target ) represents the absolute value of the error of each pixel in the same area. Of course, in some examples, calculation can also be performed pixel by pixel, which is not limited by this disclosure.
其中,具体计算L 1损失函数可以参考现有方式实现,本公开不再赘述。 Among them, the specific calculation of the L 1 loss function can be implemented by referring to existing methods, and will not be described in detail in this disclosure.
本公开在确定第一损失函数时引入权重属性,以便为预测错误偏差较大的位置赋予更高的损失权重,从而可以更好的训练初始深度模型,使得训练得到的深度网络模型的识别结果更为准确。This disclosure introduces a weight attribute when determining the first loss function, so as to assign a higher loss weight to the location with a large prediction error deviation, so that the initial depth model can be better trained, making the recognition results of the trained deep network model more accurate. for accuracy.
为了保证深度估计方案中得到的深度结果轮廓更加清晰、减少噪点、提升表现能力,并且保证可以满足后期任务对深度细节要求较高的场景,例如计算摄影等场景。因此本公开还可以通过以下一个或多个损失函数对初始深度模型进行调整,例如第二损失函数、第三损失函数。In order to ensure that the depth results obtained in the depth estimation scheme have clearer outlines, reduce noise, improve performance capabilities, and ensure that they can meet scenes with high depth detail requirements in later tasks, such as computational photography and other scenes. Therefore, the present disclosure can also adjust the initial depth model through one or more of the following loss functions, such as the second loss function and the third loss function.
在一些实施例中,标签深度可以包括至少两个方向上的标签深度梯度。例如图10所示出的,S31中所涉及到的第二损失函数可以通过以下步骤确定:In some embodiments, the label depth may include label depth gradients in at least two directions. For example, as shown in Figure 10, the second loss function involved in S31 can be determined through the following steps:
在步骤S51中,根据估计深度和采用预先设定的梯度函数,确定至少两个方向上的估计深度梯度。In step S51, estimated depth gradients in at least two directions are determined based on the estimated depth and using a preset gradient function.
在一些例子中,可以根据初始深度模型预测出的估计深度带入预先设定的梯度函数中,从而确定至少两个方向上的估计深度梯度。在一些例子中,通常可以选择2个不同的 方向,当然在其它例子中,还可以选择更多方向或者更少的方向,本公开不做限定。In some examples, the estimated depth predicted according to the initial depth model can be brought into a preset gradient function to determine the estimated depth gradient in at least two directions. In some examples, two different directions can usually be selected. Of course, in other examples, more directions or fewer directions can be selected, which is not limited by this disclosure.
在一个例子中,例如方向可以为x,y两个方向。而确定x,y两个方向上的估计深度梯度,则可以通过公式5和公式6计算得到。In one example, the directions may be x and y directions. To determine the estimated depth gradient in the x and y directions, it can be calculated by Formula 5 and Formula 6.
Figure PCTCN2022099713-appb-000004
Figure PCTCN2022099713-appb-000004
Figure PCTCN2022099713-appb-000005
Figure PCTCN2022099713-appb-000005
可以理解的是,D可以是D pred,则通过公式5计算得到x方向上的D x可以表示为
Figure PCTCN2022099713-appb-000006
以及通过公式6计算得到y方向上的D y可以表示为
Figure PCTCN2022099713-appb-000007
其中,
Figure PCTCN2022099713-appb-000008
表示为使用索贝尔(sobel)算子在数据上进行卷积。其中,
Figure PCTCN2022099713-appb-000009
Figure PCTCN2022099713-appb-000010
即是sobel算子较为常用的矩阵。当然,具体可以根据实际情况对矩阵进行调整,从而可以得到更多方向上的梯度数据。
It can be understood that D can be D pred , then the D x in the x direction calculated through Formula 5 can be expressed as
Figure PCTCN2022099713-appb-000006
And D y in the y direction calculated through formula 6 can be expressed as
Figure PCTCN2022099713-appb-000007
in,
Figure PCTCN2022099713-appb-000008
Represented as convolution on the data using the Sobel operator. in,
Figure PCTCN2022099713-appb-000009
and
Figure PCTCN2022099713-appb-000010
That is the matrix more commonly used by the sobel operator. Of course, the matrix can be adjusted according to the actual situation, so that gradient data in more directions can be obtained.
可以理解的是,图10中所涉及到的梯度与进行训练时采用SGD的梯度含义不同,SGD涉及的梯度是指损失函数变化的梯度。而图10中所涉及到的梯度,是指训练图像中不同区域深度变化的梯度。It can be understood that the gradients involved in Figure 10 have different meanings from the gradients used in SGD for training. The gradients involved in SGD refer to the gradient of the change of the loss function. The gradient involved in Figure 10 refers to the gradient of depth changes in different areas in the training image.
在步骤S52中,根据标签深度和梯度函数,确定至少两个方向上的标签深度梯度。In step S52, label depth gradients in at least two directions are determined according to the label depth and the gradient function.
在一些例子中,标签深度可以包含至少两个方向上的标签深度梯度。其中,标签深度梯度所涉及的至少两个方向,与估计深度梯度所涉及的至少两个方向上为相同方向。当然,在另一些例子中,还可以基于公式5和公式6分别计算得到x方向上的标签深度梯度
Figure PCTCN2022099713-appb-000011
以及计算得到y方向上的标签深度梯度
Figure PCTCN2022099713-appb-000012
In some examples, the label depth may include label depth gradients in at least two directions. Wherein, at least two directions involved in the label depth gradient are the same direction as at least two directions involved in the estimated depth gradient. Of course, in other examples, the label depth gradient in the x direction can also be calculated based on Formula 5 and Formula 6 respectively.
Figure PCTCN2022099713-appb-000011
And calculate the label depth gradient in the y direction
Figure PCTCN2022099713-appb-000012
在步骤S53中,根据至少两个方向上的估计深度梯度和至少两个方向上的标签深度梯度,确定至少两个方向的深度梯度误差。In step S53, depth gradient errors in at least two directions are determined based on estimated depth gradients in at least two directions and label depth gradients in at least two directions.
在一些例子中,可以根据S51中确定的至少两个方向上的估计深度梯度,以及S52中确定的至少两个方向上的标签深度梯度,确定至少两个方向的深度梯度误差。例如,基于x方向上,可以根据
Figure PCTCN2022099713-appb-000013
Figure PCTCN2022099713-appb-000014
计算得到x方向上的深度梯度误差,记为
Figure PCTCN2022099713-appb-000015
Figure PCTCN2022099713-appb-000016
同理,基于y方向上,可以根据
Figure PCTCN2022099713-appb-000017
Figure PCTCN2022099713-appb-000018
计算得到y方向上的深度梯度 误差,记为
Figure PCTCN2022099713-appb-000019
In some examples, depth gradient errors in at least two directions may be determined based on the estimated depth gradients in at least two directions determined in S51 and the tag depth gradients in at least two directions determined in S52. For example, based on the x direction, you can
Figure PCTCN2022099713-appb-000013
and
Figure PCTCN2022099713-appb-000014
The depth gradient error in the x direction is calculated, recorded as
Figure PCTCN2022099713-appb-000015
Figure PCTCN2022099713-appb-000016
In the same way, based on the y direction, it can be based on
Figure PCTCN2022099713-appb-000017
and
Figure PCTCN2022099713-appb-000018
The depth gradient error in the y direction is calculated, recorded as
Figure PCTCN2022099713-appb-000019
在步骤S54中,根据至少两个方向的深度梯度误差,确定第二损失函数。In step S54, a second loss function is determined based on depth gradient errors in at least two directions.
在一些例子中,可以根据S53中确定的至少两个方向上的深度梯度误差,确定第二损失函数。其中,第二损失函数也可以称为梯度损失函数。In some examples, the second loss function may be determined based on the depth gradient errors in at least two directions determined in S53. Among them, the second loss function can also be called the gradient loss function.
在一些例子中,例如可以通过公式7计算得到第二损失函数,In some examples, the second loss function can be calculated by formula 7,
Figure PCTCN2022099713-appb-000020
Figure PCTCN2022099713-appb-000020
其中,L grad表示第二损失函数。在一些例子中,计算L grad时,可以基于一个有效区域掩膜或天空掩膜计算,即结合该有效区域掩膜或天空掩膜内各像素点是否有效,以及该区域掩膜或天空掩膜中所有像素点x方向上的梯度绝对值、所有像素点y方向上的梯度误差绝对值。当然,在一些例子中,也可以逐个像素进行计算,本公开不做限定。 Among them, L grad represents the second loss function. In some examples, when calculating L grad , it can be calculated based on an effective area mask or sky mask, that is, combining whether each pixel in the effective area mask or sky mask is valid, and the area mask or sky mask The absolute value of the gradient of all pixels in the x direction, and the absolute value of the gradient error of all pixels in the y direction. Of course, in some examples, calculation can also be performed pixel by pixel, which is not limited by this disclosure.
本公开还引入了梯度损失,使得训练得到的深度网络模型可以更准确的识别出图像的梯度,从而可以更好地识别出不同深度的边界。The present disclosure also introduces gradient loss, so that the trained deep network model can more accurately identify the gradient of the image, thereby better identifying the boundaries of different depths.
在一些实施例中,例如图11所示出的,S31中所涉及到的第三损失函数可以通过以下步骤确定:In some embodiments, such as shown in Figure 11, the third loss function involved in S31 can be determined through the following steps:
在步骤S61中,针对训练图像,确定多个训练数据组。In step S61, multiple training data groups are determined for the training images.
在一些例子中,服务设备可以针对每个训练图像,确定出多个训练数据组。其中,每个训练数据组可以由训练图像中的至少两个像素点构成。因此,训练图像的标签深度数据可以包括相应像素点对应的深度标签数据。In some examples, the service device may determine multiple training data groups for each training image. Each training data group may be composed of at least two pixels in the training image. Therefore, the label depth data of the training image may include depth label data corresponding to the corresponding pixel point.
在一些实施例中,确定多个训练数据组,可以包括:利用预先设定的梯度函数确定训练图像的梯度边界。然后,根据梯度边界在训练图像上进行采样,确定多个训练数据组。可以理解的是,确定多个训练数据组的过程可以称为数据组(pair)采样。In some embodiments, determining multiple training data groups may include: using a preset gradient function to determine the gradient boundary of the training image. Then, multiple training data groups are determined by sampling on the training images according to the gradient boundaries. It can be understood that the process of determining multiple training data groups may be referred to as pair sampling.
在一些例子中,基于梯度边界进行pair采样,也可以称为深度结构采样。例如,可以基于sobel算子对训练图像进行梯度计算,从而确定出哪些区域梯度变化较大。例如可以预先设置梯度阈值,当不同位置的梯度差大于等于梯度阈值时,则可以认为深度梯度变化较大,从而可以确定出相应的梯度边界。可以理解的是,对于图10所描述的第二损失函数,则可以有效调整此部分梯度边界计算的准确性。例如图11所示出的,假设训练图像对应的深度图为图12所示出的深度图,则基于梯度边界进行pair采样的示意图可以如图13所示出的。可以看出,深度结构采样的各像素点位于梯度边界周围。在一些例子中,每个训练数据组可以由训练图像中的两个像素点构成。则可以在梯度边界的一侧采集一个像素 点,然后再在梯度边界上采集一个像素点,并组成一个训练数据组。或者还可以基于梯度边界一侧采集的一个像素点,再在梯度边界的另一侧采集一个像素点,组成一个训练数据组。通过上述方式,可以确定出多个训练数据组。In some examples, pair sampling is performed based on gradient boundaries, which can also be called deep structure sampling. For example, the gradient of the training image can be calculated based on the sobel operator to determine which areas have large gradient changes. For example, the gradient threshold can be set in advance. When the gradient difference at different locations is greater than or equal to the gradient threshold, the depth gradient can be considered to have changed significantly, and the corresponding gradient boundary can be determined. It can be understood that for the second loss function described in Figure 10, the accuracy of this part of the gradient boundary calculation can be effectively adjusted. For example, as shown in Figure 11, assuming that the depth map corresponding to the training image is the depth map shown in Figure 12, the schematic diagram of pair sampling based on the gradient boundary can be as shown in Figure 13. It can be seen that each pixel point sampled by the depth structure is located around the gradient boundary. In some examples, each training data set can be composed of two pixels in the training image. Then you can collect a pixel on one side of the gradient boundary, and then collect a pixel on the gradient boundary to form a training data group. Or you can also collect a pixel on one side of the gradient boundary and then collect a pixel on the other side of the gradient boundary to form a training data group. In the above manner, multiple training data groups can be determined.
本公开根据梯度边界确定训练数据组,可以使得后续训练过程中对不同深度区分更加准确,进而使得训练得到的深度网络模型的识别结果更为准确。The present disclosure determines the training data group based on the gradient boundary, which can make the distinction between different depths more accurate in the subsequent training process, thereby making the recognition results of the trained deep network model more accurate.
在一些实施例中,确定多个训练数据组,还可以包括:在训练图像上进行随机采样,确定多个训练数据组。In some embodiments, determining multiple training data groups may also include: randomly sampling on the training images to determine multiple training data groups.
在一些例子中,为了避免在训练过程对梯度边界部分训练较多,而忽略了梯度变化缓慢的部分。因此,本公开还可以在训练图像上进行随机的pair采样,即随机采样。从而确定出多个训练组。例如图14所示出的,可以看出,随机采样得到的多个训练数据组分布均匀,从而有效的保留了梯度变化缓慢区域对应的像素点。本公开根据梯度边界确定训练数据组,可以使得后续训练过程中对不同深度区分更加准确,进而使得训练得到的深度网络模型的识别结果更为准确。In some examples, in order to avoid training more on the gradient boundary part during the training process, the slowly changing part of the gradient is ignored. Therefore, the present disclosure can also perform random pair sampling on the training image, that is, random sampling. Multiple training groups are thereby determined. For example, as shown in Figure 14, it can be seen that the multiple training data groups obtained by random sampling are evenly distributed, thus effectively retaining pixels corresponding to areas with slow gradient changes. The present disclosure determines the training data group based on the gradient boundary, which can make the distinction between different depths more accurate in the subsequent training process, thereby making the recognition results of the trained deep network model more accurate.
继续回到图11,当S41执行后,可以继续执行以下步骤。Return to Figure 11. After S41 is executed, the following steps can be continued.
在步骤S62中,针对每个训练数据组,根据至少两个像素点对应的标签深度,确定深度结构损失参数。In step S62, for each training data group, determine the depth structure loss parameter based on the label depth corresponding to at least two pixels.
在一些例子中,服务设备可以针对每个训练数据组,根据该训练数据组内至少两个像素点对应的标签深度,确定深度结构损失参数。在一些例子中,若训练数据组中包括两个像素点,则可以基于训练数据组中的两个像素点对应的标签深度,确定深度结构损失参数。In some examples, the service device can determine the depth structure loss parameter for each training data group based on the label depth corresponding to at least two pixels in the training data group. In some examples, if the training data set includes two pixels, the depth structure loss parameter can be determined based on the label depth corresponding to the two pixels in the training data set.
其中,深度结构损失参数可以记为ρ。例如,可以通过公式8计算得到ρ。Among them, the depth structure loss parameter can be recorded as ρ. For example, ρ can be calculated using Equation 8.
Figure PCTCN2022099713-appb-000021
Figure PCTCN2022099713-appb-000021
其中,a、b分别表示训练数据组中的两个像素点。Among them, a and b respectively represent two pixels in the training data group.
在步骤S63中,针对每个训练数据组,根据深度结构损失参数和至少两个像素点对应的估计深度,确定第三损失函数。In step S63, for each training data group, a third loss function is determined based on the depth structure loss parameter and the estimated depth corresponding to at least two pixels.
在一些例子中,服务设备可以根据S62中确定的深度结构损失参数ρ,以及训练数据组中包括的至少两个像素点对应的估计深度,确定第三损失函数。其中,第三损失函数还可以称为深度图结构采样损失函数。例如,可以通过公式9计算得到第三损失函数。In some examples, the service device may determine the third loss function based on the depth structure loss parameter ρ determined in S62 and the estimated depth corresponding to at least two pixels included in the training data group. Among them, the third loss function can also be called the depth map structure sampling loss function. For example, the third loss function can be calculated through Formula 9.
Figure PCTCN2022099713-appb-000022
Figure PCTCN2022099713-appb-000022
其中,L pair表示为第三损失函数,
Figure PCTCN2022099713-appb-000023
表示为e的
Figure PCTCN2022099713-appb-000024
Figure PCTCN2022099713-appb-000025
次幂。可以看出,在一些例子中,第三损失函数采用了排序损失(ranking loss)进行损失函数的计算。
Among them, L pair represents the third loss function,
Figure PCTCN2022099713-appb-000023
expressed as e
Figure PCTCN2022099713-appb-000024
Figure PCTCN2022099713-appb-000025
Power. It can be seen that in some examples, the third loss function uses ranking loss (ranking loss) to calculate the loss function.
本公开还可以通过两两像素为一组的训练组确定第三损失函数,从而再后续训练过程中,可以更准确的对图像中不同深度区域进行采样,以使得训练得到的深度网络模型的识别结果更为准确。The present disclosure can also determine the third loss function through a training group of two pixels as a group, so that in the subsequent training process, different depth areas in the image can be sampled more accurately, so as to enable the recognition of the deep network model obtained by training. The results are more accurate.
在一些例子中,目标损失函数可以根据上述第一损失函数、第二损失函数和第三损失函数中的至少一个确定得到,例如公式10给出了目标损失函数的确定方式。In some examples, the target loss function can be determined based on at least one of the above-mentioned first loss function, second loss function, and third loss function. For example, Formula 10 provides a method for determining the target loss function.
Figure PCTCN2022099713-appb-000026
Figure PCTCN2022099713-appb-000026
其中,L total表示目标损失函数,参数λ 1为第一损失函数的加权系数,λ 2为第二损失函数的加权系数,λ 3为第三损失函数的加权系数。可以理解,λ 1、λ 2和λ 3的具体数值可以根据实际情况进行设定,本公开不作限定。显然,可以通过调整λ 1、λ 2和λ 3的具体数值,以采用第一损失函数、第二损失函数和第三损失函数中的一个或多个,得到目标损失函数,以训练得到深度网络模型。 Among them, L total represents the target loss function, the parameter λ 1 is the weighting coefficient of the first loss function, λ 2 is the weighting coefficient of the second loss function, and λ 3 is the weighting coefficient of the third loss function. It can be understood that the specific values of λ 1 , λ 2 and λ 3 can be set according to actual conditions, and are not limited in this disclosure. Obviously, the specific values of λ 1 , λ 2 and λ 3 can be adjusted to use one or more of the first loss function, the second loss function and the third loss function to obtain the target loss function to train the deep network Model.
可以理解,第一损失函数、第二损失函数和第三损失函数对模型调整的目的不同。当确定目标损失函数时所用到的损失函数越多,利用该目标损失函数训练得到的深度网络模型的识别准确率越高。It can be understood that the first loss function, the second loss function and the third loss function have different purposes for adjusting the model. The more loss functions used when determining the target loss function, the higher the recognition accuracy of the deep network model trained using the target loss function.
在一些实施例中,对于输入至深度网络模型中的输入图像,例如输入图像可以是待处理图像或训练图像,可以包括:对输入图像进行归一化,得到归一化后的输入图像。然后,将归一化后的输入图像输入至深度网络模型。In some embodiments, for the input image input to the deep network model, for example, the input image may be an image to be processed or a training image, which may include: normalizing the input image to obtain a normalized input image. Then, the normalized input image is fed into the deep network model.
在一些例子中,对于输入图像的归一化,例如可以结合输入图像的彩色通道的均值和方差进行处理,例如公式11所示出的。In some examples, for the normalization of the input image, the mean and variance of the color channels of the input image can be combined, for example, as shown in Equation 11.
I norm=(I-m)/v                       ……公式11 I norm =(Im)/v…Formula 11
其中,m表示均值,v表示方差,I表示输入图像,I norm表示归一化后的输入图像。在一些例子中,I的彩色通道可以按照BGR排布。可以理解,B表示蓝色(blue)通道、G表示绿色(green)通道、R表示红色(red)通道。在一些例子中,m可以取(0.485,0.456,0.506),v可以取(0.229,0.224,0.225)。可以理解,上述m、v仅为一种取值的示例性描述,在其它例子中,可以根据实际情况进行任意取值,本公开不做限定。 Among them, m represents the mean, v represents the variance, I represents the input image, and I norm represents the normalized input image. In some cases, the color channels of I can be arranged according to BGR. It can be understood that B represents the blue channel, G represents the green channel, and R represents the red channel. In some examples, m can take (0.485, 0.456, 0.506), and v can take (0.229, 0.224, 0.225). It can be understood that the above m and v are only an exemplary description of one value. In other examples, any value can be set according to the actual situation, and this disclosure does not limit it.
在一些例子中,在训练阶段,对于训练图像对应的标签深度也可以进行归一化。由于训练图像集来源较为广泛,不同训练图像集中训练图像对应的标签深度的深度等级(scale)、变换(shift)可能是不一致的。因此可以对训练图像对应的标签深度进行归一化。当然,可以理解的是,此方式也可以对初始深度模型预测得到的估计深度进行归一化处理。In some examples, during the training phase, the depth of the labels corresponding to the training images can also be normalized. Since training image sets come from a wide range of sources, the depth level (scale) and transformation (shift) of the label depth corresponding to the training images in different training image sets may be inconsistent. Therefore, the depth of the labels corresponding to the training images can be normalized. Of course, it is understandable that this method can also normalize the estimated depth predicted by the initial depth model.
例如,可以通过公式12计算标签深度的中值。For example, the median label depth can be calculated by Equation 12.
D t=median(D)                       ……公式12 D t =median(D)…Formula 12
其中,D t表示D的中值。D表示有效区域内各像素对应的标签深度。 Among them, D t represents the median value of D. D represents the label depth corresponding to each pixel in the effective area.
然后,可以通过公式13确定标签深度中去掉中值之后的均值。Then, the mean value after removing the median value in the label depth can be determined by Equation 13.
Figure PCTCN2022099713-appb-000027
Figure PCTCN2022099713-appb-000027
其中,M表示有效区域V mask中有效像素的个数。可以理解的是,V mask也可以替换为S maskAmong them, M represents the number of effective pixels in the effective area V mask . It is understood that V mask can also be replaced by S mask .
之后,可以基于公式12确定的D t和公式13中的D s确定归一化后的标签深度。即,如公式14所示出的 Afterwards, the normalized label depth can be determined based on D t determined in Equation 12 and D s in Equation 13. That is, as shown in Equation 14
Figure PCTCN2022099713-appb-000028
Figure PCTCN2022099713-appb-000028
本公开还可以对输入模型的图像进行归一化,保证了训练过程中可以在同一维度下完成训练,进而使得训练得到的深度网络模型的识别结果更为准确。The present disclosure can also normalize the images of the input model, ensuring that the training can be completed in the same dimension during the training process, thereby making the recognition results of the trained deep network model more accurate.
在一些实施例中,在实际训练过程中,可以采用jpg图像格式彩色图片作为训练图片。同时,可以采用png图像格式的深度图片作为训练图像对应的深度标签数据。对于天空分割图片可以采用png图片格式。若基于pair训练组进行训练,则在一些例子中,可以采用 80万左右的训练数据组,即pair组。In some embodiments, during the actual training process, color pictures in jpg image format may be used as training pictures. At the same time, depth images in png image format can be used as depth label data corresponding to training images. For sky segmentation pictures, png picture format can be used. If training is performed based on pair training groups, in some examples, a training data group of about 800,000, that is, a pair group, can be used.
本公开所采用的深度网络模型具有深度可分离卷积,从而实现深度网络模型在终端设备上的部署,并有效节约了运行时长,避免了大型模型在终端设备上运行耗时长、不适配的情况。The deep network model used in this disclosure has depth separable convolution, thereby realizing the deployment of the deep network model on the terminal device, effectively saving the running time, and avoiding the long and unsuitable running time of large models on the terminal device. Condition.
需要说明的是,本领域内技术人员可以理解,本公开实施例上述涉及的各种实施方式/实施例中可以配合前述的实施例使用,也可以是独立使用。无论是单独使用还是配合前述的实施例一起使用,其实现原理类似。本公开实施中,部分实施例中是以一起使用的实施方式进行说明的。当然,本领域内技术人员可以理解,这样的举例说明并非对本公开实施例的限定。It should be noted that those skilled in the art can understand that the various implementations/embodiments mentioned above in the embodiments of the present disclosure can be used in conjunction with the foregoing embodiments or can be used independently. Whether used alone or in conjunction with the foregoing embodiments, the implementation principles are similar. In the implementation of the present disclosure, some embodiments are described in terms of implementations used together. Of course, those skilled in the art can understand that such illustrations do not limit the embodiments of the present disclosure.
基于相同的构思,本公开实施例还提供一种图像深度预测装置。Based on the same concept, embodiments of the present disclosure also provide an image depth prediction device.
可以理解的是,本公开实施例提供的图像深度预测装置为了实现上述功能,其包含了执行各个功能相应的硬件结构和/或软件模块。结合本公开实施例中所公开的各示例的单元及算法步骤,本公开实施例能够以硬件或硬件和计算机软件的结合形式来实现。某个功能究竟以硬件还是计算机软件驱动硬件的方式来执行,取决于技术方案的特定应用和设计约束条件。本领域技术人员可以对每个特定的应用来使用不同的方法来实现所描述的功能,但是这种实现不应认为超出本公开实施例的技术方案的范围。It can be understood that, in order to implement the above functions, the image depth prediction device provided by the embodiments of the present disclosure includes hardware structures and/or software modules corresponding to each function. Combined with the units and algorithm steps of each example disclosed in the embodiments of the present disclosure, the embodiments of the present disclosure can be implemented in the form of hardware or a combination of hardware and computer software. Whether a function is performed by hardware or computer software driving the hardware depends on the specific application and design constraints of the technical solution. Those skilled in the art can use different methods to implement the described functions for each specific application, but such implementation should not be considered to go beyond the scope of the technical solutions of the embodiments of the present disclosure.
图15是根据一示例性实施例示出的一种深度网络模型训练装置的示意图。参照图15,该装置100可以是,该装置100可以包括:获取模块101,用于获取待处理图像;预测模块102,用于将待处理图像输入至深度网络模型,预测待处理图像的图像深度,深度网络模型由多层深度可分离卷积构成;其中,深度网络模型采用目标损失函数训练得到,目标损失函数基于误差权重参数、深度梯度误差以及深度结构损失参数中的至少一项确定;误差权重参数用于表征估计深度和标签深度之间差异的权重,深度梯度误差用于表征估计深度和标签深度之间梯度差异,深度结构损失参数用于表征训练图像中不同位置对应的标签深度差异;估计深度为训练阶段深度网络模型基于训练图像确定的,标签深度与训练图像相对应。Figure 15 is a schematic diagram of a deep network model training device according to an exemplary embodiment. Referring to Figure 15, the device 100 may include: an acquisition module 101, used to acquire an image to be processed; a prediction module 102, used to input the image to be processed into a deep network model, and predict the image depth of the image to be processed. , the deep network model is composed of multi-layer depth separable convolutions; among them, the deep network model is trained using a target loss function, and the target loss function is determined based on at least one of the error weight parameter, depth gradient error, and depth structure loss parameter; the error The weight parameter is used to characterize the weight of the difference between the estimated depth and the label depth, the depth gradient error is used to characterize the gradient difference between the estimated depth and the label depth, and the depth structure loss parameter is used to characterize the difference in label depth corresponding to different positions in the training image; The estimated depth is determined by the deep network model in the training stage based on the training image, and the label depth corresponds to the training image.
本公开采用的深度网络模型具有深度可分离卷积,从而实现深度网络模型在终端设备上的部署,并有效节约了运行时长,避免了大型模型在终端设备上运行耗时长、不适配的情况。The deep network model used in this disclosure has depth-separable convolution, thereby realizing the deployment of the deep network model on the terminal device, effectively saving the running time, and avoiding the long-running and unsuitable situation of large-scale models on the terminal device. .
在一种可能的实施方式中,目标损失函数采用如下方式确定:根据第一损失函数、第二损失函数和第三损失函数中的至少一项,确定目标损失函数;其中,第一损失函数基于误差权重参数确定,第二损失函数基于深度梯度误差确定,第三损失函数基于深度结构损 失参数确定。In a possible implementation, the target loss function is determined in the following manner: the target loss function is determined based on at least one of the first loss function, the second loss function, and the third loss function; wherein the first loss function is based on The error weight parameter is determined, the second loss function is determined based on the depth gradient error, and the third loss function is determined based on the depth structure loss parameter.
本公开可以通过多个损失函数相结合,以用于对深度网络模型训练,从而保证了经过训练的深度网络模型可以更加准确识地别出图像深度。The present disclosure can be combined with multiple loss functions to train a deep network model, thereby ensuring that the trained deep network model can more accurately identify the depth of the image.
在一种可能的实施方式中,第一损失函数采用如下方式确定:根据估计深度和标签深度,确定估计深度和标签深度之间的误差绝对值;根据误差绝对值以及误差权重参数,确定第一损失函数。In a possible implementation, the first loss function is determined in the following manner: based on the estimated depth and the label depth, the absolute value of the error between the estimated depth and the label depth is determined; based on the absolute value of the error and the error weight parameter, the first loss function is determined. loss function.
本公开在确定第一损失函数时引入权重属性,以便为预测错误偏差较大的位置赋予更高的损失权重,从而可以更好的训练初始深度模型,使得训练得到的深度网络模型的识别结果更为准确。This disclosure introduces a weight attribute when determining the first loss function, so as to assign a higher loss weight to the location with a large prediction error deviation, so that the initial depth model can be better trained, making the recognition results of the trained deep network model more accurate. for accuracy.
在一种可能的实施方式中,深度梯度误差采用如下方式确定:根据估计深度和预先设定的梯度函数,确定至少两个方向上的估计深度梯度;根据标签深度和梯度函数,确定至少两个方向上的标签深度梯度;根据至少两个方向上的估计深度梯度和至少两个方向上的标签深度梯度,确定至少两个方向的深度梯度误差;第二损失函数采用如下方式确定:根据至少两个方向的深度梯度误差,确定第二损失函数。In a possible implementation, the depth gradient error is determined in the following manner: based on the estimated depth and a preset gradient function, determine the estimated depth gradient in at least two directions; based on the label depth and the gradient function, determine at least two direction; determine the depth gradient error in at least two directions according to the estimated depth gradient in at least two directions and the label depth gradient in at least two directions; the second loss function is determined in the following way: based on at least two directions The depth gradient error in each direction determines the second loss function.
本公开还引入了梯度损失,使得训练得到的深度网络模型可以更准确的识别出图像的梯度,从而可以更好地识别出不同深度的边界。The present disclosure also introduces gradient loss, so that the trained deep network model can more accurately identify the gradient of the image, thereby better identifying the boundaries of different depths.
在一种可能的实施方式中,深度结构损失参数采用如下方式确定:确定多个训练数据组,其中,训练数据组包含至少两个像素点,至少两个像素点为训练图像的像素点,标签深度包括训练图像中至少两个像素点对应的训练标签;针对每个训练数据组,根据至少两个像素点对应的标签深度,确定深度结构损失参数;第三损失函数采用如下方式确定:针对每个训练数据组,根据深度结构损失参数和至少两个像素点对应的估计深度,确定第三损失函数。In a possible implementation, the depth structure loss parameters are determined in the following manner: multiple training data groups are determined, wherein the training data groups include at least two pixels, and the at least two pixels are pixels of the training image, and the label The depth includes training labels corresponding to at least two pixels in the training image; for each training data group, the depth structure loss parameters are determined based on the depth of the labels corresponding to at least two pixels; the third loss function is determined in the following way: for each A training data group is formed, and a third loss function is determined based on the depth structure loss parameter and the estimated depth corresponding to at least two pixels.
本公开还可以通过两两像素为一组的训练组确定第三损失函数,从而再后续训练过程中,可以更准确的对图像中不同深度区域进行采样,以使得训练得到的深度网络模型的识别结果更为准确。The present disclosure can also determine the third loss function through a training group of two pixels as a group, so that in the subsequent training process, different depth areas in the image can be sampled more accurately, so as to enable the recognition of the deep network model obtained by training. The results are more accurate.
在一种可能的实施方式中,确定多个训练数据组,包括:利用预先设定的梯度函数确定训练图像的梯度边界;根据梯度边界在训练图像上进行采样,确定多个训练数据组。In a possible implementation, determining multiple training data groups includes: using a preset gradient function to determine the gradient boundary of the training image; sampling the training image according to the gradient boundary to determine multiple training data groups.
本公开根据梯度边界确定训练数据组,可以使得后续训练过程中对不同深度区分更加准确,进而使得训练得到的深度网络模型的识别结果更为准确。The present disclosure determines the training data group based on the gradient boundary, which can make the distinction between different depths more accurate in the subsequent training process, thereby making the recognition results of the trained deep network model more accurate.
在一种可能的实施方式中,训练图像采用如下方式确定:确定任意场景下的至少一个训练图像集,训练图像集中的图像是针对设定区域进掩膜处理后的图像;根据预先配置的 采样权重从至少一个训练图像集中确定训练图像。In a possible implementation, the training images are determined in the following manner: determine at least one training image set in any scene, and the images in the training image set are images after mask processing for the set area; according to the preconfigured sampling The weights determine training images from at least one training image set.
本公开通过获取多个场景下的训练图像,从而使得经过训练的深度网络模型具有更强的泛化能力,可以实现任意场景下的深度预测。并且通过对设定区域进行掩膜处理,使得处理后的图像更有利于模型的训练,提升模型训练后的预测精度。By acquiring training images in multiple scenarios, the present disclosure enables the trained deep network model to have stronger generalization capabilities and can achieve depth prediction in any scenario. And by masking the set area, the processed image is more conducive to model training and improves the prediction accuracy after model training.
关于上述实施例中的装置100,其中各个模块执行操作的具体方式已经在有关方法的实施例中进行了详细描述,此处将不做详细阐述说明。Regarding the device 100 in the above embodiments, the specific manner in which each module performs operations has been described in detail in the relevant method embodiments, and will not be described in detail here.
图16是根据一示例性实施例示出的一种图像深度预测设备200的框图。例如,设备200可以是移动电话,计算机,数字广播终端,消息收发设备,游戏控制台,平板设备,医疗设备,健身设备,个人数字助理、RedCap终端等终端设备。FIG. 16 is a block diagram of an image depth prediction device 200 according to an exemplary embodiment. For example, the device 200 may be a mobile phone, a computer, a digital broadcast terminal, a messaging device, a game console, a tablet device, a medical device, a fitness device, a personal digital assistant, a RedCap terminal and other terminal devices.
参照图16,设备200可以包括以下一个或多个组件:处理组件202,存储器204,电力组件206,多媒体组件208,音频组件210,输入/输出(I/O)接口212,传感器组件214,以及通信组件216。Referring to Figure 16, device 200 may include one or more of the following components: processing component 202, memory 204, power component 206, multimedia component 208, audio component 210, input/output (I/O) interface 212, sensor component 214, and Communication component 216.
处理组件202通常控制设备200的整体操作,诸如与显示,电话呼叫,数据通信,相机操作和记录操作相关联的操作。处理组件202可以包括一个或多个处理器220来执行指令,以完成上述的方法的全部或部分步骤。此外,处理组件202可以包括一个或多个模块,便于处理组件202和其他组件之间的交互。例如,处理组件202可以包括多媒体模块,以方便多媒体组件208和处理组件202之间的交互。 Processing component 202 generally controls the overall operations of device 200, such as operations associated with display, phone calls, data communications, camera operations, and recording operations. The processing component 202 may include one or more processors 220 to execute instructions to complete all or part of the steps of the above method. Additionally, processing component 202 may include one or more modules that facilitate interaction between processing component 202 and other components. For example, processing component 202 may include a multimedia module to facilitate interaction between multimedia component 208 and processing component 202.
存储器204被配置为存储各种类型的数据以支持在设备200的操作。这些数据的示例包括用于在设备200上操作的任何应用程序或方法的指令,联系人数据,电话簿数据,消息,图片,视频等。存储器204可以由任何类型的易失性或非易失性存储设备或者它们的组合实现,如静态随机存取存储器(SRAM),电可擦除可编程只读存储器(EEPROM),可擦除可编程只读存储器(EPROM),可编程只读存储器(PROM),只读存储器(ROM),磁存储器,快闪存储器,磁盘或光盘。 Memory 204 is configured to store various types of data to support operations at device 200 . Examples of such data include instructions for any application or method operating on device 200, contact data, phonebook data, messages, pictures, videos, etc. Memory 204 may be implemented by any type of volatile or non-volatile storage device, or a combination thereof, such as static random access memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EEPROM), Programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disk.
电力组件206为设备200的各种组件提供电力。电力组件206可以包括电源管理系统,一个或多个电源,及其他与为设备200生成、管理和分配电力相关联的组件。 Power component 206 provides power to various components of device 200 . Power components 206 may include a power management system, one or more power supplies, and other components associated with generating, managing, and distributing power to device 200 .
多媒体组件208包括在所述设备200和用户之间的提供一个输出接口的屏幕。在一些实施例中,屏幕可以包括液晶显示器(LCD)和触摸面板(TP)。如果屏幕包括触摸面板,屏幕可以被实现为触摸屏,以接收来自用户的输入信号。触摸面板包括一个或多个触摸传感器以感测触摸、滑动和触摸面板上的手势。所述触摸传感器可以不仅感测触摸或滑动动作的边界,而且还检测与所述触摸或滑动操作相关的持续时间和压力。在一些实施例中,多媒体组件208包括一个前置摄像头和/或后置摄像头。当设备200处于操作模式,如拍摄 模式或视频模式时,前置摄像头和/或后置摄像头可以接收外部的多媒体数据。每个前置摄像头和后置摄像头可以是一个固定的光学透镜系统或具有焦距和光学变焦能力。 Multimedia component 208 includes a screen that provides an output interface between the device 200 and the user. In some embodiments, the screen may include a liquid crystal display (LCD) and a touch panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive input signals from the user. The touch panel includes one or more touch sensors to sense touches, swipes, and gestures on the touch panel. The touch sensor may not only sense the boundary of a touch or slide action, but also detect the duration and pressure associated with the touch or slide action. In some embodiments, multimedia component 208 includes a front-facing camera and/or a rear-facing camera. When the device 200 is in an operating mode, such as a shooting mode or a video mode, the front camera and/or the rear camera can receive external multimedia data. Each front-facing camera and rear-facing camera can be a fixed optical lens system or have a focal length and optical zoom capabilities.
音频组件210被配置为输出和/或输入音频信号。例如,音频组件210包括一个麦克风(MIC),当设备200处于操作模式,如呼叫模式、记录模式和语音识别模式时,麦克风被配置为接收外部音频信号。所接收的音频信号可以被进一步存储在存储器204或经由通信组件216发送。在一些实施例中,音频组件210还包括一个扬声器,用于输出音频信号。 Audio component 210 is configured to output and/or input audio signals. For example, audio component 210 includes a microphone (MIC) configured to receive external audio signals when device 200 is in operating modes, such as call mode, recording mode, and speech recognition mode. The received audio signals may be further stored in memory 204 or sent via communications component 216 . In some embodiments, audio component 210 also includes a speaker for outputting audio signals.
I/O接口212为处理组件202和外围接口模块之间提供接口,上述外围接口模块可以是键盘,点击轮,按钮等。这些按钮可包括但不限于:主页按钮、音量按钮、启动按钮和锁定按钮。The I/O interface 212 provides an interface between the processing component 202 and a peripheral interface module, which may be a keyboard, a click wheel, a button, etc. These buttons may include, but are not limited to: Home button, Volume buttons, Start button, and Lock button.
传感器组件214包括一个或多个传感器,用于为设备200提供各个方面的状态评估。例如,传感器组件214可以检测到设备200的打开/关闭状态,组件的相对定位,例如所述组件为设备200的显示器和小键盘,传感器组件214还可以检测设备200或设备200一个组件的位置改变,用户与设备200接触的存在或不存在,设备200方位或加速/减速和设备200的温度变化。传感器组件214可以包括接近传感器,被配置用来在没有任何的物理接触时检测附近物体的存在。传感器组件214还可以包括光传感器,如CMOS或CCD图像传感器,用于在成像应用中使用。在一些实施例中,该传感器组件214还可以包括加速度传感器,陀螺仪传感器,磁传感器,压力传感器或温度传感器。 Sensor component 214 includes one or more sensors for providing various aspects of status assessment for device 200 . For example, the sensor component 214 can detect the open/closed state of the device 200, the relative positioning of components, such as the display and keypad of the device 200, and the sensor component 214 can also detect a change in position of the device 200 or a component of the device 200. , the presence or absence of user contact with device 200 , device 200 orientation or acceleration/deceleration and temperature changes of device 200 . Sensor assembly 214 may include a proximity sensor configured to detect the presence of nearby objects without any physical contact. Sensor assembly 214 may also include a light sensor, such as a CMOS or CCD image sensor, for use in imaging applications. In some embodiments, the sensor component 214 may also include an acceleration sensor, a gyroscope sensor, a magnetic sensor, a pressure sensor, or a temperature sensor.
通信组件216被配置为便于设备200和其他设备之间有线或无线方式的通信。设备200可以接入基于通信标准的无线网络,如WiFi,2G、3G、4G或5G,或它们的组合。在一个示例性实施例中,通信组件216经由广播信道接收来自外部广播管理系统的广播信号或广播相关信息。在一个示例性实施例中,所述通信组件216还包括近场通信(NFC)模块,以促进短程通信。例如,在NFC模块可基于射频识别(RFID)技术,红外数据协会(IrDA)技术,超宽带(UWB)技术,蓝牙(BT)技术和其他技术来实现。 Communication component 216 is configured to facilitate wired or wireless communications between device 200 and other devices. Device 200 may access a wireless network based on a communication standard, such as WiFi, 2G, 3G, 4G or 5G, or a combination thereof. In one exemplary embodiment, the communication component 216 receives broadcast signals or broadcast related information from an external broadcast management system via a broadcast channel. In an exemplary embodiment, the communications component 216 also includes a near field communications (NFC) module to facilitate short-range communications. For example, the NFC module can be implemented based on radio frequency identification (RFID) technology, infrared data association (IrDA) technology, ultra-wideband (UWB) technology, Bluetooth (BT) technology and other technologies.
在示例性实施例中,设备200可以被一个或多个应用专用集成电路(ASIC)、数字信号处理器(DSP)、数字信号处理设备(DSPD)、可编程逻辑器件(PLD)、现场可编程门阵列(FPGA)、控制器、微控制器、微处理器或其他电子元件实现,用于执行上述方法。In an exemplary embodiment, device 200 may be configured by one or more application specific integrated circuits (ASICs), digital signal processors (DSPs), digital signal processing devices (DSPDs), programmable logic devices (PLDs), field programmable Gate array (FPGA), controller, microcontroller, microprocessor or other electronic components are implemented for executing the above method.
在示例性实施例中,还提供了一种包括指令的非临时性计算机可读存储介质,例如包括指令的存储器204,上述指令可由设备200的处理器220执行以完成上述方法。例如,所述非临时性计算机可读存储介质可以是ROM、随机存取存储器(RAM)、CD-ROM、磁带、软盘和光数据存储设备等。In an exemplary embodiment, a non-transitory computer-readable storage medium including instructions, such as a memory 204 including instructions, which can be executed by the processor 220 of the device 200 to complete the above method is also provided. For example, the non-transitory computer-readable storage medium may be ROM, random access memory (RAM), CD-ROM, magnetic tape, floppy disk, optical data storage device, etc.
图17是根据一示例性实施例示出的一种深度网络模型训练设备300示意图。例如, 设备300可以被提供为服务器。可以理解,设备300可以用于实现对深度网络模型的训练。参照图17,设备300包括处理组件322,其进一步包括一个或多个处理器,以及由存储器332所代表的存储器资源,用于存储可由处理组件322执行的指令,例如应用程序。存储器332中存储的应用程序可以包括一个或一个以上的每一个对应于一组指令的模块。此外,处理组件322被配置为执行指令,以执行上述方法中训练深度网络模型的过程方法。Figure 17 is a schematic diagram of a deep network model training device 300 according to an exemplary embodiment. For example, device 300 may be provided as a server. It can be understood that the device 300 can be used to implement training of a deep network model. Referring to FIG. 17 , device 300 includes processing component 322 , which further includes one or more processors, and memory resources, represented by memory 332 , for storing instructions, such as application programs, executable by processing component 322 . The application program stored in memory 332 may include one or more modules, each of which corresponds to a set of instructions. In addition, the processing component 322 is configured to execute instructions to perform the process method of training the deep network model in the above method.
设备300还可以包括一个电源组件326被配置为执行设备300的电源管理,一个有线或无线网络接口350被配置为将设备300连接到网络,和一个输入输出(I/O)接口358。设备300可以操作基于存储在存储器332的操作系统,例如Windows ServerTM,Mac OS XTM,UnixTM,LinuxTM,FreeBSDTM或类似。 Device 300 may also include a power supply component 326 configured to perform power management of device 300, a wired or wireless network interface 350 configured to connect device 300 to a network, and an input-output (I/O) interface 358. Device 300 may operate based on an operating system stored in memory 332, such as Windows Server™, Mac OS X™, Unix™, Linux™, FreeBSD™ or the like.
本公开综合设计的深度估计网络具有轻量化、准确性高的特点,能部署到手持设备场景。The depth estimation network comprehensively designed in this disclosure is lightweight and highly accurate, and can be deployed in handheld device scenarios.
进一步的,采集大量深度数据集,并给这些不同数量的数据集设置不同采样权重,用以平衡数据分布。对不同scale和shift的数据使用归一化操作使得可以在同一维度下完成训练。同时,在深度估计网络中设计了高效语义信息获取模块EASPP,在较少参数量和计算量的情况下实现对语义信息的抓取。本公开还使用多维度深度损失函数提升深度在物体边缘的预测质量。Furthermore, a large number of deep data sets are collected, and different sampling weights are set for these different numbers of data sets to balance the data distribution. Using normalization operations on data of different scales and shifts allows training to be completed in the same dimension. At the same time, an efficient semantic information acquisition module EASPP is designed in the depth estimation network to capture semantic information with a smaller amount of parameters and calculations. The present disclosure also uses a multi-dimensional depth loss function to improve the quality of depth prediction at object edges.
本公开上述方案解决了深度估计模型在手持设备场景耗时长问题,实现在手机端部署,浮点模型在高通7325平台可以达到150ms。同时,本公开可以实现任意场景下深度估计,拥有更强的泛化能力。本公开还可以对于人像边缘、绿植边界等深度的准确预测,满足拍摄虚化场景对深度估计结果的要求。The above-mentioned solution of the present disclosure solves the problem that the depth estimation model takes a long time in the handheld device scene, and can be deployed on the mobile phone. The floating-point model can reach 150ms on the Qualcomm 7325 platform. At the same time, the present disclosure can achieve depth estimation in any scenario and has stronger generalization capabilities. The present disclosure can also accurately predict the depth of portrait edges, green plant boundaries, etc., and meet the requirements for depth estimation results when shooting blurred scenes.
进一步可以理解的是,本公开中“多个”是指两个或两个以上,其它量词与之类似。“和/或”,描述关联对象的关联关系,表示可以存在三种关系,例如,A和/或B,可以表示:单独存在A,同时存在A和B,单独存在B这三种情况。字符“/”一般表示前后关联对象是一种“或”的关系。单数形式的“一种”、“所述”和“该”也旨在包括多数形式,除非上下文清楚地表示其他含义。It can be further understood that “plurality” in this disclosure refers to two or more, and other quantifiers are similar. "And/or" describes the relationship between related objects, indicating that there can be three relationships. For example, A and/or B can mean: A exists alone, A and B exist simultaneously, and B exists alone. The character "/" generally indicates that the related objects are in an "or" relationship. The singular forms "a", "the" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.
进一步可以理解的是,术语“第一”、“第二”等用于描述各种信息,但这些信息不应限于这些术语。这些术语仅用来将同一类型的信息彼此区分开,并不表示特定的顺序或者重要程度。实际上,“第一”、“第二”等表述完全可以互换使用。例如,在不脱离本公开范围的情况下,第一信息也可以被称为第二信息,类似地,第二信息也可以被称为第一信息。It is further understood that the terms "first", "second", etc. are used to describe various information, but such information should not be limited to these terms. These terms are only used to distinguish information of the same type from each other and do not imply a specific order or importance. In fact, expressions such as "first" and "second" can be used interchangeably. For example, without departing from the scope of the present disclosure, the first information may also be called second information, and similarly, the second information may also be called first information.
进一步可以理解的是,本公开实施例中尽管在附图中以特定的顺序描述操作,但是不 应将其理解为要求按照所示的特定顺序或是串行顺序来执行这些操作,或是要求执行全部所示的操作以得到期望的结果。在特定环境中,多任务和并行处理可能是有利的。It will be further understood that although the operations are described in a specific order in the drawings in the embodiments of the present disclosure, this should not be understood as requiring that these operations be performed in the specific order shown or in a serial order, or that it is required that Perform all operations shown to obtain the desired results. In certain circumstances, multitasking and parallel processing may be advantageous.
本领域技术人员在考虑说明书及实践这里公开的发明后,将容易想到本公开的其它实施方案。本申请旨在涵盖本公开的任何变型、用途或者适应性变化,这些变型、用途或者适应性变化遵循本公开的一般性原理并包括本公开未公开的本技术领域中的公知常识或惯用技术手段。Other embodiments of the disclosure will be readily apparent to those skilled in the art from consideration of the specification and practice of the invention disclosed herein. This application is intended to cover any variations, uses, or adaptations of the disclosure that follow the general principles of the disclosure and include common knowledge or customary technical means in the technical field that are not disclosed in the disclosure. .
应当理解的是,本公开并不局限于上面已经描述并在附图中示出的精确结构,并且可以在不脱离其范围进行各种修改和改变。本公开的范围仅由所附的权利范围来限制。It is to be understood that the present disclosure is not limited to the precise structures described above and illustrated in the accompanying drawings, and various modifications and changes may be made without departing from the scope thereof. The scope of the disclosure is limited only by the appended rights.

Claims (16)

  1. 一种图像深度预测方法,其特征在于,所述方法包括:An image depth prediction method, characterized in that the method includes:
    获取待处理图像;Get the image to be processed;
    将所述待处理图像输入至深度网络模型,预测所述待处理图像的图像深度,所述深度网络模型由多层深度可分离卷积构成;Input the image to be processed into a deep network model and predict the image depth of the image to be processed. The deep network model is composed of multiple layers of depth-separable convolution;
    其中,所述深度网络模型采用目标损失函数训练得到,所述目标损失函数基于误差权重参数、深度梯度误差以及深度结构损失参数中的至少一项确定;Wherein, the deep network model is trained using a target loss function, and the target loss function is determined based on at least one of an error weight parameter, a depth gradient error, and a depth structure loss parameter;
    所述误差权重参数用于表征估计深度和标签深度之间差异的权重,所述深度梯度误差用于表征所述估计深度和所述标签深度之间梯度差异,所述深度结构损失参数用于表征训练图像中不同位置对应的标签深度差异;所述估计深度为训练阶段所述深度网络模型基于所述训练图像确定的,所述标签深度与所述训练图像相对应。The error weight parameter is used to characterize the weight of the difference between the estimated depth and the label depth, the depth gradient error is used to characterize the gradient difference between the estimated depth and the label depth, and the depth structure loss parameter is used to characterize The difference in label depth corresponding to different positions in the training image; the estimated depth is determined by the deep network model in the training stage based on the training image, and the label depth corresponds to the training image.
  2. 根据权利要求1所述的方法,其特征在于,所述目标损失函数采用如下方式确定:The method according to claim 1, characterized in that the target loss function is determined in the following way:
    根据第一损失函数、第二损失函数和第三损失函数中的至少一项,确定所述目标损失函数;其中,所述第一损失函数基于所述误差权重参数确定,所述第二损失函数基于所述深度梯度误差确定,所述第三损失函数基于所述深度结构损失参数确定。The target loss function is determined according to at least one of the first loss function, the second loss function and the third loss function; wherein the first loss function is determined based on the error weight parameter, and the second loss function Based on the depth gradient error determination, the third loss function is determined based on the depth structure loss parameter.
  3. 根据权利要求2所述的方法,其特征在于,所述第一损失函数采用如下方式确定:The method according to claim 2, characterized in that the first loss function is determined in the following manner:
    根据所述估计深度和所述标签深度,确定所述估计深度和所述标签深度之间的误差绝对值;According to the estimated depth and the tag depth, determine the absolute value of the error between the estimated depth and the tag depth;
    根据所述误差绝对值以及所述误差权重参数,确定所述第一损失函数。The first loss function is determined according to the error absolute value and the error weight parameter.
  4. 根据权利要求2或3所述的方法,其特征在于,所述深度梯度误差采用如下方式确定:The method according to claim 2 or 3, characterized in that the depth gradient error is determined in the following manner:
    根据所述估计深度和预先设定的梯度函数,确定至少两个方向上的估计深度梯度;Determine estimated depth gradients in at least two directions according to the estimated depth and a preset gradient function;
    根据所述标签深度和所述梯度函数,确定所述至少两个方向上的标签深度梯度;Determine label depth gradients in the at least two directions according to the label depth and the gradient function;
    根据所述至少两个方向上的所述估计深度梯度和所述至少两个方向上的所述标签深度梯度,确定所述至少两个方向的所述深度梯度误差;determining the depth gradient error in the at least two directions based on the estimated depth gradient in the at least two directions and the tag depth gradient in the at least two directions;
    所述第二损失函数采用如下方式确定:The second loss function is determined in the following way:
    根据所述至少两个方向的所述深度梯度误差,确定所述第二损失函数。The second loss function is determined based on the depth gradient error in the at least two directions.
  5. 根据权利要求2-4任意一项所述的方法,其特征在于,所述深度结构损失参数采用如下方式确定:The method according to any one of claims 2-4, characterized in that the depth structure loss parameter is determined in the following manner:
    确定多个训练数据组,其中,所述训练数据组包含至少两个像素点,所述至少两个像素点为所述训练图像的像素点,所述标签深度包括所述训练图像中所述至少两个像素点对 应的训练标签;Determine multiple training data groups, wherein the training data group includes at least two pixels, the at least two pixels are pixels of the training image, and the label depth includes the at least one in the training image. The training labels corresponding to the two pixels;
    针对每个训练数据组,根据所述至少两个像素点对应的标签深度,确定所述深度结构损失参数;For each training data group, determine the depth structure loss parameter according to the label depth corresponding to the at least two pixel points;
    所述第三损失函数采用如下方式确定:The third loss function is determined in the following way:
    针对每个训练数据组,根据所述深度结构损失参数和所述至少两个像素点对应的估计深度,确定所述第三损失函数。For each training data group, the third loss function is determined based on the depth structure loss parameter and the estimated depth corresponding to the at least two pixel points.
  6. 根据权利要求5所述的方法,其特征在于,所述确定多个训练数据组,包括:The method according to claim 5, characterized in that determining multiple training data groups includes:
    利用预先设定的梯度函数确定所述训练图像的梯度边界;Determine the gradient boundary of the training image using a preset gradient function;
    根据所述梯度边界在所述训练图像上进行采样,确定所述多个训练数据组。Sampling is performed on the training image according to the gradient boundary to determine the plurality of training data groups.
  7. 根据权利要求1-6任意一项所述的方法,其特征在于,所述训练图像采用如下方式确定:The method according to any one of claims 1-6, characterized in that the training image is determined in the following manner:
    确定任意场景下的至少一个训练图像集,所述训练图像集中的图像是针对设定区域进掩膜处理后的图像;Determine at least one training image set in any scene, where the images in the training image set are images that have been masked for the set area;
    根据预先配置的采样权重从所述至少一个训练图像集中确定所述训练图像。The training images are determined from the at least one training image set according to preconfigured sampling weights.
  8. 一种图像深度预测装置,其特征在于,所述装置包括:An image depth prediction device, characterized in that the device includes:
    获取模块,用于获取待处理图像;Acquisition module, used to obtain images to be processed;
    预测模块,用于将所述待处理图像输入至深度网络模型,预测所述待处理图像的图像深度,所述深度网络模型由多层深度可分离卷积构成;A prediction module, configured to input the image to be processed into a deep network model and predict the image depth of the image to be processed. The deep network model is composed of multiple layers of depth-separable convolution;
    其中,所述深度网络模型采用目标损失函数训练得到,所述目标损失函数基于误差权重参数、深度梯度误差以及深度结构损失参数中的至少一项确定;Wherein, the deep network model is trained using a target loss function, and the target loss function is determined based on at least one of an error weight parameter, a depth gradient error, and a depth structure loss parameter;
    所述误差权重参数用于表征估计深度和标签深度之间差异的权重,所述深度梯度误差用于表征所述估计深度和所述标签深度之间梯度差异,所述深度结构损失参数用于表征训练图像中不同位置对应的标签深度差异;所述估计深度为训练阶段所述深度网络模型基于所述训练图像确定的,所述标签深度与所述训练图像相对应。The error weight parameter is used to characterize the weight of the difference between the estimated depth and the label depth, the depth gradient error is used to characterize the gradient difference between the estimated depth and the label depth, and the depth structure loss parameter is used to characterize The difference in label depth corresponding to different positions in the training image; the estimated depth is determined by the deep network model in the training stage based on the training image, and the label depth corresponds to the training image.
  9. 根据权利要求8所述的装置,其特征在于,所述目标损失函数采用如下方式确定:The device according to claim 8, characterized in that the target loss function is determined in the following manner:
    根据第一损失函数、第二损失函数和第三损失函数中的至少一项,确定所述目标损失函数;其中,所述第一损失函数基于所述误差权重参数确定,所述第二损失函数基于所述深度梯度误差确定,所述第三损失函数基于所述深度结构损失参数确定。The target loss function is determined according to at least one of the first loss function, the second loss function and the third loss function; wherein the first loss function is determined based on the error weight parameter, and the second loss function Based on the depth gradient error determination, the third loss function is determined based on the depth structure loss parameter.
  10. 根据权利要求9所述的装置,其特征在于,所述第一损失函数采用如下方式确定:The device according to claim 9, characterized in that the first loss function is determined in the following manner:
    根据所述估计深度和所述标签深度,确定所述估计深度和所述标签深度之间的误差绝对值;According to the estimated depth and the tag depth, determine the absolute value of the error between the estimated depth and the tag depth;
    根据所述误差绝对值以及所述误差权重参数,确定所述第一损失函数。The first loss function is determined according to the error absolute value and the error weight parameter.
  11. 根据权利要求9或10所述的装置,其特征在于,所述深度梯度误差采用如下方式确定:The device according to claim 9 or 10, characterized in that the depth gradient error is determined in the following manner:
    根据所述估计深度和预先设定的梯度函数,确定至少两个方向上的估计深度梯度;Determine estimated depth gradients in at least two directions according to the estimated depth and a preset gradient function;
    根据所述标签深度和所述梯度函数,确定所述至少两个方向上的标签深度梯度;Determine label depth gradients in the at least two directions according to the label depth and the gradient function;
    根据所述至少两个方向上的所述估计深度梯度和所述至少两个方向上的所述标签深度梯度,确定所述至少两个方向的所述深度梯度误差;determining the depth gradient error in the at least two directions based on the estimated depth gradient in the at least two directions and the tag depth gradient in the at least two directions;
    所述第二损失函数采用如下方式确定:The second loss function is determined in the following way:
    根据所述至少两个方向的所述深度梯度误差,确定所述第二损失函数。The second loss function is determined based on the depth gradient error in the at least two directions.
  12. 根据权利要求9-11任意一项所述的装置,其特征在于,所述深度结构损失参数采用如下方式确定:The device according to any one of claims 9-11, characterized in that the depth structure loss parameter is determined in the following manner:
    确定多个训练数据组,其中,所述训练数据组包含至少两个像素点,所述至少两个像素点为所述训练图像的像素点,所述标签深度包括所述训练图像中所述至少两个像素点对应的训练标签;Determine multiple training data groups, wherein the training data group includes at least two pixels, the at least two pixels are pixels of the training image, and the label depth includes the at least one in the training image. The training labels corresponding to the two pixels;
    针对每个训练数据组,根据所述至少两个像素点对应的标签深度,确定所述深度结构损失参数;For each training data group, determine the depth structure loss parameter according to the label depth corresponding to the at least two pixel points;
    所述第三损失函数采用如下方式确定:The third loss function is determined in the following way:
    针对每个训练数据组,根据所述深度结构损失参数和所述至少两个像素点对应的估计深度,确定所述第三损失函数。For each training data group, the third loss function is determined based on the depth structure loss parameter and the estimated depth corresponding to the at least two pixel points.
  13. 根据权利要求12所述的装置,其特征在于,所述确定多个训练数据组,包括:The device according to claim 12, wherein the determining of multiple training data groups includes:
    利用预先设定的梯度函数确定所述训练图像的梯度边界;Determine the gradient boundary of the training image using a preset gradient function;
    根据所述梯度边界在所述训练图像上进行采样,确定所述多个训练数据组。Sampling is performed on the training image according to the gradient boundary to determine the plurality of training data groups.
  14. 根据权利要求8-13任意一项所述的装置,其特征在于,所述训练图像采用如下方式确定:The device according to any one of claims 8-13, characterized in that the training image is determined in the following manner:
    确定任意场景下的至少一个训练图像集,所述训练图像集中的图像是针对设定区域进掩膜处理后的图像;Determine at least one training image set in any scene, where the images in the training image set are images that have been masked for the set area;
    根据预先配置的采样权重从所述至少一个训练图像集中确定所述训练图像。The training images are determined from the at least one training image set according to preconfigured sampling weights.
  15. 一种图像深度预测设备,其特征在于,包括:An image depth prediction device, characterized by including:
    处理器;processor;
    用于存储处理器可执行指令的存储器;Memory used to store instructions executable by the processor;
    其中,所述处理器被配置为:执行权利要求1至7中任意一项所述的方法。Wherein, the processor is configured to perform the method described in any one of claims 1 to 7.
  16. 一种非临时性计算机可读存储介质,当所述存储介质中的指令由计算机的处理器执行时,使得计算机能够执行权利要求1至7中任意一项所述的方法。A non-transitory computer-readable storage medium that, when instructions in the storage medium are executed by a processor of a computer, enables the computer to perform the method described in any one of claims 1 to 7.
PCT/CN2022/099713 2022-06-20 2022-06-20 Image depth prediction method and apparatus, device, and storage medium WO2023245321A1 (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
PCT/CN2022/099713 WO2023245321A1 (en) 2022-06-20 2022-06-20 Image depth prediction method and apparatus, device, and storage medium
CN202280004623.9A CN117616457A (en) 2022-06-20 2022-06-20 Image depth prediction method, device, equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2022/099713 WO2023245321A1 (en) 2022-06-20 2022-06-20 Image depth prediction method and apparatus, device, and storage medium

Publications (1)

Publication Number Publication Date
WO2023245321A1 true WO2023245321A1 (en) 2023-12-28

Family

ID=89378959

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/099713 WO2023245321A1 (en) 2022-06-20 2022-06-20 Image depth prediction method and apparatus, device, and storage medium

Country Status (2)

Country Link
CN (1) CN117616457A (en)
WO (1) WO2023245321A1 (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109146944A (en) * 2018-10-30 2019-01-04 浙江科技学院 A kind of space or depth perception estimation method based on the revoluble long-pending neural network of depth
CN110738697A (en) * 2019-10-10 2020-01-31 福州大学 Monocular depth estimation method based on deep learning
WO2020234984A1 (en) * 2019-05-21 2020-11-26 日本電気株式会社 Learning device, learning method, computer program, and recording medium
CN112288788A (en) * 2020-10-12 2021-01-29 南京邮电大学 Monocular image depth estimation method
CN114078149A (en) * 2020-08-21 2022-02-22 深圳市万普拉斯科技有限公司 Image estimation method, electronic equipment and storage medium
CN114170286A (en) * 2021-11-04 2022-03-11 西安理工大学 Monocular depth estimation method based on unsupervised depth learning

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109146944A (en) * 2018-10-30 2019-01-04 浙江科技学院 A kind of space or depth perception estimation method based on the revoluble long-pending neural network of depth
WO2020234984A1 (en) * 2019-05-21 2020-11-26 日本電気株式会社 Learning device, learning method, computer program, and recording medium
CN110738697A (en) * 2019-10-10 2020-01-31 福州大学 Monocular depth estimation method based on deep learning
CN114078149A (en) * 2020-08-21 2022-02-22 深圳市万普拉斯科技有限公司 Image estimation method, electronic equipment and storage medium
CN112288788A (en) * 2020-10-12 2021-01-29 南京邮电大学 Monocular image depth estimation method
CN114170286A (en) * 2021-11-04 2022-03-11 西安理工大学 Monocular depth estimation method based on unsupervised depth learning

Also Published As

Publication number Publication date
CN117616457A (en) 2024-02-27

Similar Documents

Publication Publication Date Title
CN108764091B (en) Living body detection method and apparatus, electronic device, and storage medium
CN111310616B (en) Image processing method and device, electronic equipment and storage medium
US11403763B2 (en) Image segmentation method and apparatus, computer device, and storage medium
CN107798669B (en) Image defogging method and device and computer readable storage medium
WO2020224479A1 (en) Method and apparatus for acquiring positions of target, and computer device and storage medium
EP3057304B1 (en) Method and apparatus for generating image filter
RU2577188C1 (en) Method, apparatus and device for image segmentation
TWI706379B (en) Method, apparatus and electronic device for image processing and storage medium thereof
CN109889724B (en) Image blurring method and device, electronic equipment and readable storage medium
WO2020048308A1 (en) Multimedia resource classification method and apparatus, computer device, and storage medium
TW202107339A (en) Pose determination method and apparatus, electronic device, and storage medium
CN107133354B (en) Method and device for acquiring image description information
US20220392202A1 (en) Imaging processing method and apparatus, electronic device, and storage medium
EP3905662A1 (en) Image processing method and apparatus, electronic device and storage medium
CN111680697B (en) Method, device, electronic equipment and medium for realizing field adaptation
CN112115894A (en) Training method and device for hand key point detection model and electronic equipment
CN114390201A (en) Focusing method and device thereof
CN112508959B (en) Video object segmentation method and device, electronic equipment and storage medium
WO2023245321A1 (en) Image depth prediction method and apparatus, device, and storage medium
CN115035596B (en) Behavior detection method and device, electronic equipment and storage medium
CN115623313A (en) Image processing method, image processing apparatus, electronic device, and storage medium
WO2023240452A1 (en) Image processing method and apparatus, electronic device, and storage medium
WO2023230927A1 (en) Image processing method and device, and readable storage medium
CN117152128B (en) Method and device for recognizing focus of nerve image, electronic equipment and storage medium
CN111723715B (en) Video saliency detection method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
WWE Wipo information: entry into national phase

Ref document number: 202280004623.9

Country of ref document: CN

121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22947102

Country of ref document: EP

Kind code of ref document: A1