CN116152584A

CN116152584A - Image processing method, device and storage medium

Info

Publication number: CN116152584A
Application number: CN202111355214.7A
Authority: CN
Inventors: 王国毅; 周俊伟; 刘小伟; 陈兵
Original assignee: Honor Device Co Ltd
Current assignee: Honor Device Co Ltd
Priority date: 2021-11-16
Filing date: 2021-11-16
Publication date: 2023-05-23

Abstract

The embodiment of the application provides an image processing method, an image processing device and a storage medium, and relates to the technical field of image processing, wherein the method comprises the steps that terminal equipment acquires an image to be processed by using image acquisition equipment; the image to be processed is a two-dimensional image; the terminal equipment inputs the image to be processed into a depth estimation model, and the depth estimation model is used for enabling the depth information of the image to be processed to be respectively converged on a plurality of target depth values; the target depth value corresponds to the region in the image to be processed one by one; the interval is related to the depth range of the object in the image to be processed; the depth ranges of different intervals are different; the terminal equipment outputs the depth information of the image to be processed by using the depth estimation model, so that the target depth values of different intervals in the image to be processed are different, and the depth information of the image to be processed predicted by the depth estimation model is more approximate to the actual value under the condition that the depth estimation model converges the depth information of the image to be processed to a plurality of different target depth values of different intervals.

Description

Image processing method, device and storage medium

Technical Field

The present disclosure relates to the field of image processing technologies, and in particular, to an image processing method, an image processing device, and a storage medium.

Background

Terminal devices that use augmented reality (augmented reality, AR) technology may also be referred to as AR devices. The AR device may acquire depth information of an image, which is an image representing a real scene, using a monocular depth estimation method, and implement fusion of a virtual object and the image based on the acquired depth information, and display the fused image. The virtual object may be an object of an image representation, for example, an animal, a building, a person, a landscape, etc. of the image representation. The monocular depth estimation method is a method for predicting depth information of each pixel point in an image through the image. The depth information is a distance between an object in a real scene and an image acquisition device for capturing the real scene.

Currently, the accuracy of depth information of an image acquired by an AR device through a monocular depth estimation method is poor, so that when the AR device fuses a virtual object and the image based on the acquired depth information, the problems that the position of the virtual object deviates from a preset position and/or the shielding boundary between the virtual object and a real object is inaccurate and the like may occur.

Disclosure of Invention

The embodiment of the application provides an image processing method, an image processing device and a storage medium, relates to the technical field of image processing, and is beneficial to improving the accuracy of monocular depth estimation.

In a first aspect, an embodiment of the present application provides an image processing method, including: the terminal equipment acquires an image to be processed by using an image acquisition equipment; the image to be processed is a two-dimensional image; the terminal equipment inputs the image to be processed into a depth estimation model, and the depth estimation model is used for enabling the depth information of the image to be processed to be respectively converged on a plurality of target depth values; the target depth value corresponds to the region in the image to be processed one by one; the interval is related to the depth range of the object in the image to be processed; the depth ranges of different intervals are different; and the terminal equipment acquires the depth information of the image to be processed by using the depth estimation model.

In the embodiment of the application, when the depth estimation model carries out depth prediction on an image, the depth information of the image to be processed is respectively converged on a plurality of target depth values, one target depth value corresponds to one section in the image, and the depth ranges of different sections are different, so that the depth information of pixel points in the image cannot be converged to an average value, the depth information of pixel points at a far position in the image obtained by the depth estimation model according to the application, and the depth information of the pixel points at a near position are more approximate to a true value, and the accuracy of terminal equipment when estimating the depth information in the image by using a monocular depth estimation method is improved.

In a possible implementation manner, the depth estimation model is obtained by performing iterative training on the neural network model according to a plurality of sample data and a loss function until convergence; the sample data comprises a sample image, the labeling depth information labeled by the sample image and the number of labeling intervals labeled by the sample image; the input of the neural network model includes a sample image; the output of the neural network model comprises the predicted depth information of the sample image and the predicted depth characteristic values of all intervals in the sample image; the loss function is used for representing the similarity between the marked depth information and the predicted depth information and representing the similarity between the predicted depth characteristic value and the marked depth characteristic value; the labeling depth characteristic value comprises a plurality of first depth values; the first depth value corresponds to a first interval in the sample image one by one; the first section is obtained by dividing the sample image according to the number of the marked sections.

In another possible implementation manner, the neural network model includes a feature extraction model and an interval division network; the input of the feature extraction model includes a sample image; the output of the feature extraction model comprises a feature map of the sample image; the feature map is used for representing the features of the pixel level of the sample image; the input of the interval division network is a feature map; the output of the interval division network comprises an attention characteristic graph and an interval division number; the attention profile includes depth-related features in the sample image; the predicted depth information is obtained by inputting the interval division number and the attention characteristic diagram into other layers of the neural network model; the other layers are neural network layers except for the feature extraction model and the interval division network in the neural network model; the predicted depth characteristic value comprises a plurality of second depth values; the second depth value corresponds to a second interval in the sample image one by one; the second section is obtained by dividing the sample image according to the number of section divisions.

In another possible implementation, the feature extraction model includes an encoding structure and a decoding structure; the input of the coding structure is a sample image; the output of the coding structure is a feature vector used for representing semantic information of the sample image; the input of the decoding structure is a feature vector; the output of the decoding structure is a feature map; the encoding structure and the decoding structure are connected using a jump connection. The jump connection may combine the pixel level feature map with high level semantics to reduce the disappearance of features.

In another possible implementation, the interval-dividing network is a network model based on an attention mechanism.

In another possible implementation manner, the method further includes: the terminal equipment acquires a plurality of sample data; the terminal equipment inputs the sample image into a neural network model to obtain predicted depth information of the sample image; the terminal equipment obtains a predicted depth map of the sample image according to the predicted depth information, and obtains a marked depth map of the sample image according to the marked depth information; the terminal equipment determines a first loss of a first normal vector in the predicted depth map and a corresponding second normal vector in the marked depth map; the terminal equipment determines a second loss according to the labeling depth information and the predicted depth information of the sample image; the terminal equipment determines a third loss according to the first depth values in the sample image and the second depth values in the sample image; the terminal equipment carries out weighted summation on the first loss, the second loss and the third loss to obtain total loss; and the terminal equipment iteratively updates the weight parameters of the neural network model according to the total loss to obtain a depth estimation model. Thus, training of the depth estimation model focuses not only on the loss of predicted depth information and labeled depth information, but also on the loss of normal vectors in the predicted depth map and the labeled depth map and the loss of depth values in each section. Thus, the depth information obtained by performing depth estimation on the depth estimation model obtained by training is more approximate to the true value.

In another possible implementation, the first depth value is an intermediate value of the first depth range; the first depth range is a depth range corresponding to the first interval; the second depth value is an intermediate value of the second depth range; the second depth range is a depth range corresponding to the second section. Therefore, the intermediate value of the depth range can more represent the average value of the depth range, so that the depth information obtained by predicting each interval of the image is more approximate to the true value.

In another possible implementation manner, before the terminal device inputs the image to be processed into the depth estimation model, the method further includes: the terminal equipment receives special effect selection operation; the terminal equipment responds to the special effect selection operation to display icons of at least one virtual object; the terminal equipment receives triggering operation aiming at the icon of the virtual object to be added; the terminal equipment responds to triggering operation and marks a virtual object to be added on an interface of the terminal equipment; the terminal equipment receives position selection operation aiming at a target position in the image to be processed; the terminal device obtains depth information of an image to be processed by using a depth estimation model, and the method comprises the following steps: and the terminal equipment responds to the position selection operation and acquires depth information of the image to be processed by using the depth estimation model.

In another possible implementation manner, the method further includes: the terminal equipment obtains a target image by fusion according to the image to be processed, the depth information of the image to be processed, the target position and the virtual object to be added; the target image is an image after the virtual object to be added is added to the target position of the image to be processed; the terminal device displays the target image. Therefore, the depth information of the image to be processed is predicted more accurately, so that the position of the virtual object in the image fused with the virtual object is more accurate, and the shielding boundary between the virtual object and the real object is more accurate.

In a second aspect, an embodiment of the present application provides an image processing apparatus, including a storage module and a processing module; the storage module is used for storing the depth estimation model; the depth estimation model is used for enabling the depth information of the two-dimensional image to be respectively converged on a plurality of target depth values; the target depth value corresponds to the areas in the two-dimensional image one by one; the interval is related to the depth range of the object in the two-dimensional image; the depth ranges of different intervals are different; a processing module for: acquiring a two-dimensional image by using an image acquisition device; inputting the two-dimensional image into a depth estimation model; and obtaining depth information of the image to be processed by using the depth estimation model.

Optionally, the depth estimation model is obtained by performing iterative training on the neural network model according to a plurality of sample data and a loss function until convergence; the sample data comprises a sample image, the labeling depth information labeled by the sample image and the number of labeling intervals labeled by the sample image; the input of the neural network model includes a sample image; the output of the neural network model comprises the predicted depth information of the sample image and the predicted depth characteristic values of all intervals in the sample image; the loss function is used for representing the similarity between the marked depth information and the predicted depth information and representing the similarity between the predicted depth characteristic value and the marked depth characteristic value; the labeling depth characteristic value comprises a plurality of first depth values; the first depth value corresponds to a first interval in the sample image one by one; the first section is obtained by dividing the sample image according to the number of the marked sections.

Optionally, the neural network model includes a feature extraction model and a section division network; the input of the feature extraction model includes a sample image; the output of the feature extraction model comprises a feature map of the sample image; the feature map is used for representing the features of the pixel level of the sample image; the input of the interval division network is a feature map; the output of the interval division network comprises an attention characteristic graph and an interval division number; the attention profile includes depth-related features in the sample image; the predicted depth information is obtained by inputting the interval division number and the attention characteristic diagram into other layers of the neural network model; the other layers are neural network layers except for the feature extraction model and the interval division network in the neural network model; the predicted depth characteristic value comprises a plurality of second depth values; the second depth value corresponds to a second interval in the sample image one by one; the second section is obtained by dividing the sample image according to the number of section divisions.

Optionally, the feature extraction model includes an encoding structure and a decoding structure; the input of the coding structure is a sample image; the output of the coding structure is a feature vector used for representing semantic information of the sample image; the input of the decoding structure is a feature vector; the output of the decoding structure is a feature map; the encoding structure and the decoding structure are connected using a jump connection.

Optionally, the interval-dividing network is a network model based on an attention mechanism.

Optionally: the processing module is also used for acquiring a plurality of sample data; inputting the sample image into a neural network model to obtain predicted depth information of the sample image; obtaining a predicted depth map of the sample image according to the predicted depth information, and obtaining a marked depth map of the sample image according to the marked depth information; determining a first normal vector in the predicted depth map and a first loss of a corresponding second normal vector in the marked depth map; determining a second loss according to the labeling depth information and the predicted depth information of the sample image; determining a third loss from the plurality of first depth values in the sample image and the plurality of second depth values in the sample image; weighting and summarizing the first loss, the second loss and the third loss to obtain total loss; and iteratively updating the weight parameters of the neural network model according to the total loss to obtain a depth estimation model.

Optionally, the first depth value is an intermediate value of the first depth range; the first depth range is a depth range corresponding to the first interval; the second depth value is an intermediate value of the second depth range; the second depth range is a depth range corresponding to the second section.

Optionally, before the terminal device inputs the image to be processed into the depth estimation model, the processing module is further configured to: receiving a special effect selection operation; displaying an icon of at least one virtual object in response to the special effect selection operation; receiving a triggering operation aiming at an icon of a virtual object to be added; marking a virtual object to be added on an interface of the terminal equipment in response to triggering operation; receiving a position selection operation for a target position in an image to be processed; the processing module is specifically used for: depth information of the image to be processed is acquired using a depth estimation model in response to the position selection operation.

The processing module is also used for: according to the image to be processed, the depth information of the image to be processed, the target position and the virtual object to be added, fusing to obtain a target image; the target image is an image after the virtual object to be added is added to the target position of the image to be processed; the image processing apparatus further includes a display module for displaying the target image.

In a third aspect, embodiments of the present application provide an electronic device, including a memory for storing a computer program and a processor for executing the computer program to perform the image processing method described in the first aspect or any one of the possible implementations of the first aspect.

In a fourth aspect, embodiments of the present application provide a computer-readable storage medium having stored therein a computer program or instructions which, when run on a computer, cause the computer to perform the image processing method described in the first aspect or any one of the possible implementations of the first aspect.

In a fifth aspect, embodiments of the present application provide a computer program product comprising a computer program which, when run on a computer, causes the computer to perform the image processing method described in the first aspect or any one of the possible implementations of the first aspect.

In a sixth aspect, the present application provides a chip or chip system comprising at least one processor and a communication interface, the communication interface and the at least one processor being interconnected by wires, the at least one processor being adapted to execute a computer program or instructions to perform the image processing method described in the first aspect or any one of the possible implementations of the first aspect. The communication interface in the chip can be an input/output interface, a pin, a circuit or the like.

In one possible implementation, the chip or chip system described above in the present application further includes at least one memory, where the at least one memory has instructions stored therein. The memory may be a memory unit within the chip, such as a register, a cache, etc., or may be a memory unit of the chip (e.g., a read-only memory, a random access memory, etc.).

It should be understood that, the second aspect to the sixth aspect of the present application correspond to the technical solutions of the first aspect of the present application, and the beneficial effects obtained by each aspect and the corresponding possible embodiments are similar, and are not repeated.

Drawings

Fig. 1 is a schematic structural diagram of a terminal device to which the embodiment of the present application is applicable;

fig. 2 is a software architecture block diagram of a terminal device applicable to the embodiment of the present application;

fig. 3 is a schematic diagram of a learning flow of a depth estimation model in an image processing method according to an embodiment of the present application;

fig. 4 is a schematic flow chart of an image processing method according to an embodiment of the present application;

fig. 5 is a schematic diagram of a display interface of a terminal device applicable to the embodiment of the present application;

fig. 6 is a schematic diagram of a shooting preview interface of a terminal device applicable to the embodiment of the present application;

Fig. 7 is a schematic diagram of an image preview interface of a terminal device applicable to an embodiment of the present application;

FIG. 8 is a schematic diagram of selecting a virtual object according to an embodiment of the present disclosure;

fig. 9 is an operation schematic diagram of selecting an addition location of a virtual object to be added according to an embodiment of the present application;

FIG. 10 is a schematic diagram of an AR effect adding operation provided in an embodiment of the present application;

FIG. 11 is a flowchart of another image processing method according to an embodiment of the present disclosure;

fig. 12 is a flowchart of another image processing method according to an embodiment of the present application;

FIG. 13 is a schematic diagram of an image to be processed applicable to an embodiment of the present application;

fig. 14 is a schematic view of a feature map of an image to be processed according to an embodiment of the present application;

fig. 15 is a schematic diagram of a depth map obtained by an image processing method according to an embodiment of the present application;

fig. 16 is a schematic diagram of an image processing apparatus according to an embodiment of the present application;

fig. 17 is a schematic structural diagram of a chip according to an embodiment of the present application.

Detailed Description

In order to facilitate the clear description of the technical solutions of the embodiments of the present application, the following simply describes some terms and techniques related to the embodiments of the present application:

1) Neural network

A neural network may be composed of neural units, and is understood to mean, in particular, a neural network having an input layer, an hidden layer, and an output layer, where in general, the first layer is the input layer, the last layer is the output layer, and the middle layers are all hidden layers. Among them, the neural network with many hidden layers is called deep neural network (deep neural network, DNN). The operation of each layer in the neural network can be described by a mathematical expression, and from the physical level, the operation of each layer in the neural network can be understood as the conversion of an input space (a set of input vectors) into an output space (i.e., a row space to a column space of a matrix) by five operations including: 1. dimension increasing/decreasing; 2. zoom in/out; 3. rotating; 4. translating; 5. bending. The purpose of training the neural network is to finally obtain the weight matrix of all layers of the trained neural network. Thus, the training process of the neural network is essentially a way to learn and control the spatial transformation, and more specifically to learn the weight matrix.

It should be noted that, in the embodiments of the present application, the neural network is essentially based on a learning model (which may also be referred to as a learner, a model, etc.) employed by a task of machine learning (e.g., active learning, supervised learning, unsupervised learning, semi-supervised learning, etc.).

2) Loss function (loss function)

In the process of training the neural network, because the output of the neural network is expected to be as close to the value actually expected, the weight matrix of each layer of the neural network can be updated according to the difference between the predicted value of the current network and the actually expected target value (of course, there is usually an initialization process before the first update, that is, the parameters are preconfigured for each layer in the neural network), for example, if the predicted value of the network is higher, the weight matrix is adjusted to be lower than the predicted value, and the adjustment is continuously performed until the neural network can predict the actually expected target value. Thus, it is necessary to define in advance "how to compare the difference between the predicted value and the target value", which is a loss function (loss function) or an objective function (objective function), which are important equations for measuring the difference between the predicted value and the target value. Taking the loss function as an example, the higher the output value (loss) of the loss function is, the larger the difference is, and the training of the neural network becomes the process of reducing the loss as much as possible.

In the training process of the neural network, an error Back Propagation (BP) algorithm can be adopted to correct the size of parameters in an initial neural network model, so that the reconstruction error loss of the neural network model is smaller and smaller. Specifically, the input signal is forwarded until the output is generated with error loss, and the parameters in the initial neural network model are updated by back-propagating the error loss information, so that the error loss converges. The back propagation algorithm is a back propagation motion that dominates the error loss, and aims to obtain parameters of the optimal neural network model, such as a weight matrix.

3) Features (e.g. a character)

Features refer to input variables, i.e., x-variables in simple linear regression, a simple machine learning task may use a single feature, while a more complex machine learning task may use millions of features.

4) Other terms

In the embodiments of the present application, the words "first," "second," and the like are used to distinguish between identical or similar items that have substantially the same function and effect. For example, the first chip and the second chip are merely for distinguishing different chips, and the order of the different chips is not limited. It will be appreciated by those of skill in the art that the words "first," "second," and the like do not limit the amount and order of execution, and that the words "first," "second," and the like do not necessarily differ.

It should be noted that, in the embodiments of the present application, words such as "exemplary" or "such as" are used to mean serving as an example, instance, or illustration. Any embodiment or design described herein as "exemplary" or "for example" should not be construed as preferred or advantageous over other embodiments or designs. Rather, the use of words such as "exemplary" or "such as" is intended to present related concepts in a concrete fashion.

In the embodiments of the present application, "at least one" means one or more, and "a plurality" means two or more. "and/or", describes an association relationship of an association object, and indicates that there may be three relationships, for example, a and/or B, and may indicate: a alone, a and B together, and B alone, wherein a, B may be singular or plural. The character "/" generally indicates that the context-dependent object is an "or" relationship. "at least one of" or the like means any combination of these items, including any combination of single item(s) or plural items(s). For example, at least one (one) of a, b, or c may represent: a, b, c, a-b, a-c, b-c, or a-b-c, wherein a, b, c may be single or plural.

When the terminal device fuses the depth information estimated by using the monocular depth estimation method with the image of the real scene, there may be a problem that a position of the virtual object has a position deviation from a preset position of the virtual object, or a shielding boundary between the virtual object and the real object in the real scene is inaccurate. For example, in the AR photographing process of using the terminal device, the user selects an added virtual object as a dinosaur in the photographed image, and the occlusion boundary between the dinosaur and the tree in the real scene beside the dinosaur in the photographed image is inaccurate.

This is because the regression idea-based algorithm used in the current monocular depth estimation tends to converge the depth information to an average value, and thus the depth information of the pixel at a far position and the depth information of the pixel at a near position are estimated inaccurately in the depth information of the image estimated by the monocular depth estimation method, which results in problems such as positional deviation of the virtual object and inaccuracy of the occlusion boundary between the virtual object and the real object when the virtual object is fused with the image of the real scene.

In view of this, the embodiment of the present application provides an image processing method, which may utilize a depth estimation model including a section division network obtained by training, where the depth estimation model obtains depth information of pixels in an image according to a probability that pixels in the image belong to each section, and since the possible sections are divided for the depth information of pixels in the image, the depth information of pixels in the image is obtained according to a probability that pixels belong to different sections, and the depth information of pixels in the image is not converged to an average value, so that the obtained depth information of pixels farther in the image and the depth information of pixels closer in the image approach to a true value, thereby helping to improve accuracy when a terminal device uses a monocular depth estimation method to estimate the depth information in the image.

The image processing method provided by the embodiment of the application can be applied to the terminal equipment. In this embodiment of the present application, the terminal device may also be referred to as a terminal (terminal), a User Equipment (UE), a Mobile Station (MS), a Mobile Terminal (MT), or the like. The terminal device may be a mobile phone, a smart television, a wearable device, a tablet (Pad), a desktop computer, a Virtual Reality (VR) terminal device, an augmented reality (augmented reality, AR) terminal device, a wireless terminal in industrial control (industrial control), a vehicle-mounted terminal in unmanned (self-driving), a wireless terminal in teleoperation (remote medical surgery), a wireless terminal in smart grid (smart grid), a wireless terminal in transportation security (transportation safety), a wireless terminal in smart city (smart city), a wireless terminal in smart home (smart home), etc. The embodiment of the application does not limit the specific technology and the specific equipment form adopted by the terminal equipment.

In order to better understand the embodiments of the present application, the structure of the terminal device to which the embodiments of the present application are applied is described below. As shown in fig. 1, which is a schematic structural diagram of a terminal device according to an embodiment of the present application, the terminal device 10 shown in fig. 1 may include a processor 110, a memory 120, a universal serial bus (universal serial bus, USB) interface 130, a power supply 140, a communication module 150, an audio module 170, a sensor module 180, a key 190, a camera 191, a display screen 160, and the like. The sensor module 180 may include a pressure sensor 180A, a fingerprint sensor 180B, a touch sensor 180C, and the like.

It should be understood that the structure illustrated in the embodiments of the present invention does not constitute a specific limitation on the terminal apparatus 10. In other embodiments of the present application, the terminal device 10 may include more or less components than illustrated, or certain components may be combined, or certain components may be split, or different arrangements of components. The illustrated components may be implemented in hardware, software, or a combination of software and hardware.

The processor 110 may include one or more processing units, such as: processor 110 may include an application processor (application processor, AP), modem processor, graphics processor (graphics processing unit, GPU), image signal processor (image signal processor, ISP), controller, digital signal processor (digital signal processor, DSP), baseband processor, etc. Wherein the different processing units may be separate devices or may be integrated in one or more processors.

The controller can generate operation control signals according to the instruction operation codes and the time sequence signals to finish the control of instruction fetching and instruction execution.

A memory may also be provided in the processor 110 for storing instructions and data. In some embodiments, the memory in the processor 110 is a cache memory. The memory may hold instructions or data that the processor 110 has just used or recycled. If the processor 110 needs to reuse the instruction or data, it can be called directly from the memory. Repeated accesses are avoided and the latency of the processor 110 is reduced, thereby improving the efficiency of the system.

In some embodiments, the processor 110 may include one or more interfaces. The interfaces may include an integrated circuit (inter-integrated circuit, I2C) interface, an integrated circuit built-in audio (inter-integrated circuit sound, I2S) interface, a pulse code modulation (pulse code modulation, PCM) interface, a universal asynchronous receiver transmitter (universal asynchronous receiver/transmitter, UART) interface, a mobile industry processor interface (mobile industry processor interface, MIPI), a general-purpose input/output (GPIO) interface, a subscriber identity module (subscriber identity module, SIM) interface, and/or a universal serial bus (universal serial bus, USB) interface, among others.

The I2C interface is a bi-directional synchronous serial bus comprising a serial data line (SDA) and a serial clock line (derail clock line, SCL). In some embodiments, the processor 110 may contain multiple sets of I2C buses. The processor 110 may be coupled to the touch sensor 180C, charger, flash, camera 191, etc., respectively, through different I2C bus interfaces. For example: the processor 110 may be coupled to the touch sensor 180C through an I2C interface, so that the processor 110 and the touch sensor 180C communicate through an I2C bus interface to implement a touch function of the terminal device 10.

The I2S interface may be used for audio communication. In some embodiments, the processor 110 may contain multiple sets of I2S buses. The processor 110 may be coupled to the audio module 170 via an I2S bus to enable communication between the processor 110 and the audio module 170. In some embodiments, the audio module 170 may transmit an audio signal to the wireless communication module 160 through the I2S interface, to implement a function of answering a call through the bluetooth headset.

PCM interfaces may also be used for audio communication to sample, quantize and encode analog signals. In some embodiments, the audio module 170 and the wireless communication module 160 may be coupled through a PCM bus interface. In some embodiments, the audio module 170 may also transmit audio signals to the wireless communication module 160 through the PCM interface to implement a function of answering a call through the bluetooth headset. Both the I2S interface and the PCM interface may be used for audio communication.

The UART interface is a universal serial data bus for asynchronous communications. The bus may be a bi-directional communication bus. It converts the data to be transmitted between serial communication and parallel communication. In some embodiments, a UART interface is typically used to connect the processor 110 with the wireless communication module 160. For example: the processor 110 communicates with a bluetooth module in the wireless communication module 160 through a UART interface to implement a bluetooth function. In some embodiments, the audio module 170 may transmit an audio signal to the wireless communication module 160 through a UART interface, to implement a function of playing music through a bluetooth headset.

The MIPI interface may be used to connect the processor 110 to peripheral devices such as the display screen 160, the camera 191, and the like. The MIPI interfaces include camera serial interfaces (camera serial interface, CSI), display serial interfaces (display serial interface, DSI), and the like. In some embodiments, processor 110 and camera 191 communicate via a CSI interface to implement the photographing function of terminal device 10. The processor 110 and the display screen 160 communicate via a DSI interface to implement the display function of the terminal device 10.

The USB interface 130 is an interface conforming to the USB standard specification, and may specifically be a Mini USB interface, a Micro USB interface, a USB Type C interface, or the like. The USB interface 130 may be used to connect a charger to charge the terminal device 10, or may be used to transfer data between the terminal device 10 and a peripheral device. And can also be used for connecting with a headset, and playing audio through the headset. The interface may also be used to connect other electronic devices, such as AR devices, etc.

It should be understood that the interfacing relationship between the modules illustrated in the embodiment of the present invention is only illustrative, and does not constitute a structural limitation of the terminal device 10. In other embodiments of the present application, the terminal device 10 may also use different interfacing manners, or a combination of multiple interfacing manners in the foregoing embodiments.

The power supply 140 supplies power to the terminal device 10.

The communication module 150 may use any transceiver-like device for communicating with other devices or communication networks, such as a wide area network (wide area network, WAN), local area network (local area networks, LAN), etc.

The terminal device 10 implements display functions through a GPU, a display screen 160, an application processor, and the like. The GPU is a microprocessor for image processing, and is connected to the display 160 and the application processor. The GPU is used to perform mathematical and geometric calculations for graphics rendering. Processor 110 may include one or more GPUs that execute program instructions to generate or change display information.

The display screen 160 is used to display images, videos, and the like. The display screen 160 includes a display panel. The display panel may employ a liquid crystal display (liquid crystal display, LCD), an organic light-emitting diode (OLED), an active-matrix organic light-emitting diode (AMOLED) or an active-matrix organic light-emitting diode (matrix organic light emitting diode), a flexible light-emitting diode (flex), a mini, a Micro led, a Micro-OLED, a quantum dot light-emitting diode (quantum dot light emitting diodes, QLED), or the like. In some embodiments, the terminal device 10 may include 1 or N displays 160, N being a positive integer greater than 1.

The terminal device 10 may implement a photographing function through an ISP, a camera 191, a video codec, a GPU, a display screen 160, an application processor, and the like.

The ISP is used to process the data fed back by the camera 191. For example, when photographing, the shutter is opened, light is transmitted to the camera photosensitive element through the lens, the optical signal is converted into an electric signal, and the camera photosensitive element transmits the electric signal to the ISP for processing and is converted into an image visible to naked eyes. ISP can also optimize the noise, brightness and skin color of the image. The ISP can also optimize parameters such as exposure, color temperature and the like of a shooting scene. In some embodiments, the ISP may be provided in the camera 191.

The camera 191 is used to capture still images or video. The object generates an optical image through the lens and projects the optical image onto the photosensitive element. The photosensitive element may be a charge coupled device (charge coupled device, CCD) or a Complementary Metal Oxide Semiconductor (CMOS) phototransistor. The photosensitive element converts the optical signal into an electrical signal, which is then transferred to the ISP to be converted into a digital image signal. The ISP outputs the digital image signal to the DSP for processing. The DSP converts the digital image signal into an image signal in a standard RGB, YUV, or the like format. In some embodiments, the terminal device 10 may include 1 or N cameras 191, N being a positive integer greater than 1. Illustratively, the camera 191 may be used to perform the acquisition of the image to be processed shown in S400 shown in fig. 4.

The digital signal processor is used for processing digital signals, and can process other digital signals besides digital image signals. For example, when the terminal device 10 selects a frequency bin, the digital signal processor is used to fourier transform the frequency bin energy, or the like.

Memory 120 may be used to store one or more computer programs, including instructions. The processor 110 may cause the terminal device 10 to execute various functional applications, data processing, and the like by executing the above-described instructions stored in the memory 120. The memory 120 may include a stored program area and a stored data area. The storage program area can store an operating system; the storage area may also store one or more applications (e.g., gallery, contacts, etc.), and so forth. For example, the memory 120 may store a program area that may store a depth estimation model trained by the methods provided by embodiments of the present application.

The storage data area may store data created during use of the terminal device 10 (e.g., photos, etc.), and the like. In addition, the memory 120 may include a high-speed random access memory, and may also include a nonvolatile memory, such as at least one magnetic disk storage device, a flash memory device, a universal flash memory (universal flash storage, UFS), and the like. In some embodiments, processor 110 may cause terminal device 10 to perform various functional applications and data processing by executing instructions stored in memory 120, and/or instructions stored in memory provided in processor 110.

The terminal device 10 may implement audio functions through an audio module 170, a speaker 170A, a receiver 170B, a microphone 170C, an earphone interface 170D, an application processor, and the like. Such as music playing, recording, etc.

The audio module 170 is used to convert digital audio information into an analog audio signal output and also to convert an analog audio input into a digital audio signal. The audio module 170 may also be used to encode and decode audio signals. In some embodiments, the audio module 170 may be disposed in the processor 110, or a portion of the functional modules of the audio module 170 may be disposed in the processor 110.

The pressure sensor 180A is used to sense a pressure signal, and may convert the pressure signal into an electrical signal. In some embodiments, pressure sensor 180A may be disposed on display screen 160. The pressure sensor 180A is of various types, such as a resistive pressure sensor, an inductive pressure sensor, a capacitive pressure sensor, and the like. The capacitive pressure sensor may be a capacitive pressure sensor comprising at least two parallel plates with conductive material. The capacitance between the electrodes changes when a force is applied to the pressure sensor 180A. The terminal device 10 determines the strength of the pressure from the change in capacitance. When a touch operation is applied to the display screen 160, the terminal device 10 detects the intensity of the touch operation according to the pressure sensor 180A. The terminal device 10 may also calculate the position of the touch from the detection signal of the pressure sensor 180A. In some embodiments, touch operations that act on the same touch location, but at different touch operation strengths, may correspond to different operation instructions. For example: and executing an instruction for checking the short message when the touch operation with the touch operation intensity smaller than the first pressure threshold acts on the short message application icon. And executing an instruction for newly creating the short message when the touch operation with the touch operation intensity being greater than or equal to the first pressure threshold acts on the short message application icon.

The fingerprint sensor 180B is used to collect a fingerprint. The terminal device 10 can utilize the collected fingerprint characteristics to realize fingerprint unlocking, access application locks, fingerprint photographing, fingerprint incoming call answering and the like.

The touch sensor 180C, also referred to as a "touch device". The touch sensor 180C may be disposed on the display screen 160, and the touch sensor 180C and the display screen 160 form a touch screen, which is also referred to as a "touch screen". The touch sensor 180C is used to detect a touch operation acting thereon or thereabout. The touch sensor may communicate the detected touch operation to the application processor to determine the touch event type. Visual output related to touch operations may be provided through the display screen 160. In other embodiments, the touch sensor 180C may also be disposed on a surface of the terminal device 10 at a different location than the display 160.

The keys 190 include a power-on key, a volume key, etc. The keys 190 may be mechanical keys. Or may be a touch key. The terminal device 10 may receive key inputs, generating key signal inputs related to user settings and function controls of the terminal device 10.

The software system of the terminal device 10 may employ a layered architecture, an event driven architecture, a micro-core architecture, a micro-service architecture, or a cloud architecture. In this embodiment, taking an Android system with a layered architecture as an example, a software structure of the terminal device 10 is illustrated. Fig. 2 is a software architecture block diagram of a terminal device applicable to the embodiment of the present application. The layered architecture divides the software system of the terminal device 10 into several layers, each layer having a distinct role and division of labor. The layers communicate with each other through a software interface. In some embodiments, the Android system may be divided into five layers, an application layer (applications), an application framework layer (application framework), an Zhuoyun rows (Android run) and system libraries, a hardware abstraction layer (hardware abstract layer, HAL), and a kernel layer (kernel), respectively.

The application layer may include a series of application packages that run applications by calling an application program interface (application programming interface, API) provided by the application framework layer. As shown in fig. 5, the application package may include camera, gallery, calendar, phone, map, navigation, WLAN, bluetooth, music, video, game, etc. applications.

The application framework layer provides APIs and programming frameworks for application programs of the application layer. The application framework layer includes a number of predefined functions. As shown in FIG. 2, the application framework layer may include a window manager, a content provider, a view system, a telephony manager, a resource manager, a notification manager, a database, and so forth.

The window manager is used for managing window programs. The window manager can acquire the size of the display screen, judge whether a status bar exists, lock the screen, intercept the screen and the like. The content provider is used to store and retrieve data and make such data accessible to applications. The data may include video, images, audio, calls made and received, browsing history and bookmarks, phonebooks, etc. The view system includes visual controls, such as controls to display text, controls to display pictures, and the like. The view system may be used to build applications. The display interface may be composed of one or more views. For example, a display interface including a text message notification icon may include a view displaying text and a view displaying a picture. The telephony manager is used to provide the communication functions of the terminal device 10. Such as the management of call status (including on, hung-up, etc.). The resource manager provides various resources for the application program, such as localization strings, icons, pictures, layout files, video files, and the like. The notification manager allows the application to display notification information in a status bar, can be used to communicate notification type messages, can automatically disappear after a short dwell, and does not require user interaction. Such as notification manager is used to inform that the download is complete, message alerts, etc. The notification manager may also be a notification in the form of a chart or scroll bar text that appears on the system top status bar, such as a notification of a background running application, or a notification that appears on the screen in the form of a dialog window. For example, a text message is presented in a status bar, a presentation sound is emitted, the terminal device 10 vibrates, and an indicator light blinks. The database may be used to organize, store, and manage data according to a data structure.

The android runtime includes a core library and virtual machines. And the android running time is responsible for scheduling and managing an android system. The core library consists of two parts: one part is a function which needs to be called by java language, and the other part is a core library of android. The application layer and the application framework layer run in a virtual machine. The virtual machine executes java files of the application program layer and the application program framework layer as binary files. The virtual machine is used for executing the functions of object life cycle management, stack management, thread management, security and exception management, garbage collection and the like.

The system library may include a plurality of functional modules. For example: surface manager (surface manager), media library (media library), three-dimensional graphics processing library, such as: open graphics library (open graphics library, openGL), 2D graphics engine (e.g., SGL), etc., and XX model provided by embodiments of the present application.

The surface manager is used to manage the display subsystem and provides a fusion of 2D and 3D layers for multiple applications. Media libraries support a variety of commonly used audio, video format playback and recording, still image files, and the like. The media library may support a variety of audio video encoding formats, such as: MPEG4, h.264, MP3, AAC, AMR, JPG, PNG, etc. The three-dimensional graphic processing library is used for realizing three-dimensional graphic drawing, graphic rendering, synthesis, layer processing and the like. The 2D graphics engine is a drawing engine for 2D drawing.

The hardware abstraction layer may contain a plurality of library modules, which may include, for example, sensor library modules, etc. The Android system can load a corresponding library module for the equipment hardware, so that the purpose of accessing the equipment hardware by an application program framework layer is achieved.

The kernel layer is a layer between hardware and software. The kernel layer is used for driving the hardware so that the hardware works. The kernel layer at least includes display driver, which is not limited in this embodiment.

The image processing method provided by the embodiment of the application comprises a learning flow of a depth estimation model and a flow of estimating image depth according to the depth estimation model.

Fig. 3 is a schematic diagram of a learning flow of a depth estimation model in an image processing method according to an embodiment of the present application, where the learning flow of the depth estimation model shown in fig. 3 may include the following steps:

s300: the terminal device acquires a plurality of sample data.

In the embodiment of the application, the sample data includes a sample image, a depth map of the sample image, and a section division number of the sample image. The number of division intervals is the number of sub-ranges after dividing the depth range of the image into a plurality of sub-ranges.

In a possible implementation manner, a plurality of sample images including depth information are obtained from the public data set, the sample images and the depth maps of the sample images are obtained, and the number of interval divisions of the sample images is marked.

Exemplary, sample data acquired by the terminal device includes: 25k of sample data sampled from the public indoor scene dataset and 25k of sample data sampled from the public outdoor scene dataset. The indoor scene data sets may include indoor data sets in the new york university depth data set (new york university depth datasets v, NYUDv 2) and the university digital image media laboratory data set (digital image media lab datasets, DIML); the outdoor scene data set includes (karlsruhe institute of technology and toyota technological institute datasets, KITTI) data set and outdoor data set in DIML, and the number of section divisions of the sample image is noted to obtain sample data of 50 k. In this way, obtaining sample data from different public data sets may enrich the scene that the sample image includes.

S301: the terminal device processes the sample data into a preset size.

In this embodiment of the present application, the preset size of the image processing in the sample data may be determined according to the resolution of the terminal device to which the depth estimation model is applied.

For example, if the resolution of the terminal device to which the depth estimation model is applied is 640×480 pixels, the terminal device processes the image in the sample data into 640×480 pixels, and obtains a plurality of sample data with a plurality of sizes of 640×480 pixels.

S302: and the terminal equipment performs data enhancement on the processed plurality of sample data.

In embodiments of the present application, the manner of data enhancement includes at least one of rotation, flipping, scaling, cropping, panning, color change, contrast conversion, scaling, or noise perturbation.

In a possible implementation manner, the terminal device acquires a plurality of new sample data by adopting at least one data enhancement mode of rotation, turnover, scaling, clipping, translation, color change, contrast conversion, scale conversion or noise disturbance on the plurality of processed sample data. The terminal device may also interpolate the new sample data obtained to obtain enhanced sample data.

It can be understood that the number of the sample data is increased after the terminal device performs the data enhancement processing on the plurality of processed sample data, and the sample data is distributed more uniformly.

S303: the terminal equipment trains a depth estimation model according to the plurality of sample data after data enhancement until convergence.

In the embodiment of the application, the depth estimation model comprises a feature extraction model and an interval division network. Wherein the feature extraction model includes an encoding structure and a decoding structure. The input of the coding structure is an image, and the output is a feature vector used for representing semantic information of the input image. The input of the decoding structure is the characteristic vector output by the encoding structure, and the output of the decoding structure is the characteristic diagram with the same size as the input image. The feature map is used to characterize the input image pixel level. Illustratively, the output of the decoding structure is a feature matrix of 640 x 480 x 100.

In this embodiment of the present application, the coding structure may be constructed according to ResNet, resNeXt, resNeSt or the uiet, and the terminal device may remove the two later layers of the res net101 to obtain the coding structure.

In the embodiment of the application, a jump connection is used between the encoding structure and the decoding structure. Illustratively, the jump connection may be a full-scale jump connection. Full-scale jump connections may combine pixel-level feature maps with high-level semantics.

In the embodiment of the present application, the input of the interval division network is a feature map output by the decoding structure, and the output of the interval division network is an attention feature map and an interval division number. The attention profile includes features in the image that are related to depth information.

In a possible implementation, the interval-dividing network is a network model based on an attention mechanism. The attention mechanism-based network model is a network model in which an attention model is added to a neural network structure of the network model. The added attention model may be used to capture the correlation of hidden features in the encoded structure with hidden features in the decoded structure, equivalent to providing a mechanism for the decoded structure to access specific correlated locations in the input feature vector.

The attention mechanism-based network model can obtain image features strongly related to image depth information, so that the accuracy of depth estimation can be improved by using the attention mechanism.

In a possible implementation, the terminal device trains the depth estimation model from the plurality of sample data by the following steps.

Step one: and the terminal equipment predicts the depth information of the sample data according to the current depth estimation model to obtain a depth map comprising the predicted depth information.

Step two: and the terminal equipment determines the image loss according to the depth map of the sample image in the sample data and the depth map predicted by the model.

In the embodiment of the application, the image loss satisfies the formula: loss=vnl+α+ssi+β×l _BIN . Wherein, VNL is a normal vector loss value, and VNL satisfies the following formula:

wherein n is _i Is a predictive normal vector. The prediction normal vector is a normal vector obtained by constructing a plane through a first point set in a predicted depth image of a sample image, and ni is a labeling normal vector. The labeling normal vector is a normal vector obtained by constructing a plane through a second point set in a depth map in sample data, and the first point set and the second point set are both sets of three-dimensional points and correspond to the same pixel point in the sample image. N can be the number of preset normal vectors Amount of the components.

It should be noted that three points are randomly selected in the first point set, the distance between any two points in the three points is greater than or equal to a preset threshold, the included angle between any two edges formed by the selected three points is within a preset range, the three points form a plane, the plane corresponds to a normal vector, the three points corresponding to the pixel positions of the three points in the first point set are selected in the second point set to form another plane, and the plane corresponds to a labeling normal vector.

Exemplary, three points randomly acquired from the first point set are PA, PB, and PC, respectively, which may be used to construct the first plane, resulting in a normal vector n to the first plane _i . PA1, PB1, and PC1 are three points in the second set of points corresponding to PA, PB, and PC, respectively. The same pixel points in the sample images correspond to the PA1 and the PA, the same pixel points in the sample images correspond to the PB1 and the PB, and the same pixel points in the sample images correspond to the PC1 and the PC. PA1, PB1 and PC1 can be used to construct the second plane, resulting in a normal vector ni of the second plane.

The formula loss=vnl+α+ssi+β×l _BIN Both alpha and beta are empirically set empirical values. SSI is depth loss. The SSI satisfies the following formula:

wherein N is the number of pixels, di is the depth information of the pixels of the sample image predicted by the current depth estimation model, and di is the depth information of the corresponding pixels in the depth map. h satisfies the following formula:

Wherein, the liquid crystal display device comprises a liquid crystal display device,

t is d _i Depth information d of pixel points of sample image predicted by current depth estimation model _i * Is deepDepth information corresponding to the pixel point in the degree map.

L _BIN L is the aggregate loss of intermediate values of depth information in a section _BIN The following formula is satisfied:

L _BIN ＝chamfer(X，c(b))+chamfer(c(b)，X)

where X is a set of intermediate depth values of the section of the sample image predicted by the depth estimation model, and c (b) is a set of intermediate depth values of the section of the sample image in the sample data.

Illustratively, the number of interval divisions of the sample image annotation includes [0-1], [1-2], [2-3], [3-4], [4-5]. Wherein the intermediate depth value of [0-1] is 0.5, the intermediate depth value of [1-2] is 1.5, and so on, the set of intermediate depth values in the section divided by the section dividing number of the sample image is (0.5,1.5,2.5,3.5,4.5). The intervals of the number of intervals of the predicted sample image include [0-0.5], [0.5-1], [1-1.5], [1.5-2], [2-2.5], [2.5-3], [3-3.5], [3.5-4], [4-4.5], [4.5-5], and the set of intermediate depth values in the intervals of the number of intervals of the predicted interval is obtained as (0.25,0.75,1.25,1.75,2.25,2.75,3.25,3.75,4.25,4.75).

Step three: the terminal device reverses the determined image loss in the current depth estimation model to update the parameters in the model.

It can be understood that the terminal device repeatedly executes the steps one to three according to a plurality of different sample data until the depth estimation model converges.

It should be noted that, in the embodiment of the present application, the model convergence condition is preset, and exemplary, the model convergence condition may be that the image loss is smaller than the loss threshold.

After the terminal equipment trains the depth estimation model, the depth estimation of the monocular image can be carried out according to the depth estimation model.

The terminal device for training the depth estimation model may be a computer, for example, a notebook computer or a desktop computer. The depth estimation model may also be trained by using a server in the embodiments of the present application, which is not limited in this embodiment.

The trained depth estimation model can be transplanted to a mobile phone to carry out depth estimation of a monocular image, and can also be transplanted to a vehicle-mounted terminal to carry out depth estimation of the monocular image. This is not limiting in the embodiments of the present application.

Fig. 4 is a schematic flow chart of an image processing method according to an embodiment of the present application, where the image processing method shown in fig. 4 includes the following steps:

S400: the terminal equipment acquires an image to be processed.

In this embodiment of the present application, the image to be processed may be any two-dimensional image acquired by an image acquisition device of the terminal device.

In a possible implementation manner, the terminal device detects a touch operation of a finger (or a touch pen or the like) of a user on an icon for acquiring the image to be processed, and in response to the touch operation, the terminal device opens a shooting preview interface corresponding to the icon, and the acquired image to be processed is displayed in the preview interface of the terminal device.

For example, after detecting that the user's finger touches the camera icon 501 as shown in fig. 5, the terminal device opens the camera application in response to the user's finger touching the camera icon 501, and enters the photographing preview interface. The preview interface displayed by the terminal device may be specifically, for example, the preview interface 502 shown in fig. 6. The preview interface 502 includes an image 502A to be processed in a preview box. Also included in preview interface 502 is AR special effects control 502B, and capture button 502C. The AR special effects control 502B is used to add AR special effects to the image to be processed, and the shooting button 502C is used to trigger the terminal device to shoot a picture in the current preview frame, or to trigger the terminal device to start or stop video shooting.

The workflow of the terminal device 10 software and hardware is illustrated herein in connection with capturing a photo scene. When touch sensor 180C receives a touch operation, a corresponding hardware interrupt is issued to the kernel layer. The kernel layer processes the touch operation into the original input operation (including information such as touch coordinates, time stamp of the touch operation, etc.). The original input operation is stored at the kernel layer. The application framework layer acquires an original input operation from the kernel layer, and identifies a control corresponding to the original input operation. Taking the touch operation as an example, the control corresponding to the click operation as an example of a camera application icon, the camera application calls an interface of an application framework layer, starts the camera application, further starts a camera driver by calling a kernel layer, and captures a still image or video through the camera 191.

In another possible implementation manner, the terminal device receives a trigger operation of an edit control of the image to be processed by a user, and displays the image to be processed in a preview frame of the interface in response to the trigger operation.

Illustratively, the terminal device receives a trigger operation of an edit control for an image to be processed by a user, and displays an image to be processed 503A in a preview box of a preview interface 503 as shown in fig. 7 in response to the trigger operation, where the image to be processed 503A includes a person 503A1, a tree 503A2 in the background, and the like. Also included in preview interface 503 is AR special effects control 503B, and save button 503D. The AR special effect control 503B is used for adding an AR special effect to the image to be processed, and the storage button 503D is used for triggering the terminal equipment to store the picture in the current preview frame.

S401: and the terminal equipment receives the selection operation of the virtual object, and determines the virtual object to be added in response to the selection operation.

In the embodiment of the application, the virtual object may be a cartoon image, a virtual animal, a virtual character, or the like.

In a possible implementation manner, the terminal device receives a special effect selection operation, displays at least one virtual object in the interface in response to the special effect selection operation, receives a selection triggering operation of a user on a virtual object icon, and determines a virtual object to be added corresponding to the selection triggering operation in response to the selection triggering operation.

Illustratively, the terminal device receives a user selection operation for the AR special effects control 503B in the preview interface 503 as shown in FIG. 8, and the virtual object icons displayed in the preview interface in response to the special effects selection operation include a first virtual object icon 503B1 and a second virtual object icon 503B2. The terminal device receives a selection triggering operation of the user on the first virtual object icon 503B1, and determines that the virtual object to be added corresponding to the selection triggering operation is dinosaur in response to the selection triggering operation.

S402: the terminal equipment determines the position of the virtual object to be added in the image to be processed.

In this embodiment of the present application, the position of the virtual object to be added in the image to be processed may be the coordinates of the virtual object to be added in the image coordinate system of the image to be processed.

In a possible implementation manner, the terminal device receives a trigger operation of a user on an adding position of the virtual object to be added, and determines the position of the virtual object to be added in the image to be processed in response to the trigger operation.

For example, the terminal apparatus shown in fig. 9 receives a touch operation by a user on the position L1 of the image 503A to be processed in the preview frame of the preview interface 503 shown in fig. 9, and determines the coordinates of the position L1 in the image coordinate system in response to the touch operation.

S403: and the terminal equipment estimates the depth information of the image to be processed according to the depth estimation model.

In a possible implementation manner, the depth estimation model acquires depth information of an image to be processed by:

step one: and inputting the image to be processed into a depth estimation model, and then obtaining a feature vector representing semantic information of the image to be processed through an encoder of a feature extraction model.

Exemplary, the 640 x 480 image to be processed is input into the encoder of the feature extraction model to obtain 7 x 1024 feature vectors.

Step two: and the feature vector representing the semantic information of the image to be processed passes through a decoder of the feature extraction model to obtain a feature map with the same size as the image to be processed.

Step three: and step two, outputting the attention characteristic diagram and the intermediate value of depth information in each interval of the image to be processed after the characteristic diagram output in the step two passes through the interval division network.

Step four: the other layers of the depth estimation model acquire intermediate values of depth information in each interval and probability of each pixel in the image to be processed belonging to each interval according to the interval division number of the image to be processed and the attention feature map, and then acquire the depth information of the image to be processed according to the intermediate values of the depth information in each interval and the probability of each pixel in the image to be processed belonging to each interval.

For example, if the number of section divisions of the acquired image to be processed is 5, the intermediate values of the depth information in each section of the image to be processed are 0.5,1.5,2.5,3.5 and 4.5. The probability that the pixel points of the image to be processed belong to the interval corresponding to the intermediate value 0.5 is 0.8, and the probability that the pixel points belong to the intervals corresponding to the rest intermediate values are 0.05. Then, the depth estimation model predicts that the depth information of the pixel point of the image to be processed is: 0.5×0.8+1.5×0.05+2.5×0.05+3.5×0.05+4.5×0.05=1m. The depth information of the rest pixel points of the image to be processed can also be obtained in this way, and the description is omitted.

Note that, the probability that each pixel belongs to each section may be different.

In the embodiment of the application, in the training process of a depth estimation model used in the depth prediction of an image to be processed, the loss of a normal vector of a depth map of a predicted depth map and a sample image is referred to, the loss of the predicted depth information and the depth information of the depth map is referred to, and the loss of a set of intermediate values of the depth of each section after the section division is performed on the depth information of the image to be processed by a section division network is considered. Therefore, when the depth estimation model trained by the method provided by the embodiment of the application performs depth prediction on the image to be processed, the depth information of the image to be processed can be divided into a plurality of sections according to the feature map of the image to be processed, and then the depth information of each pixel point is estimated to be more approximate to the real depth information according to the probability that the pixel point belongs to each section.

S404: and the terminal equipment fuses the image to be processed, the depth information of the image to be processed, the position of the virtual object to be added in the image to be processed and the virtual object to be added to obtain a target image.

In a possible implementation manner, the terminal device determines an occlusion relationship between an object in the image to be processed and the virtual object to be added according to depth information of the image to be processed, the virtual object to be added and a position of the virtual object to be added in the image to be processed, and fuses the image to be processed and the virtual object to be added into a target image.

Based on the example of the virtual object to be added selected in S401 and the position of the virtual object to be added selected in S402 in the image to be processed, as shown in fig. 10, the terminal device receives a selection trigger operation of the user on the first virtual object icon 503B1, determines that the virtual object to be added corresponding to the selection trigger operation is dinosaur in response to the selection trigger operation, receives a touch operation of the user on the position L1 of the image to be processed 503A in the preview frame of the preview interface 503, determines the coordinates of the position L1 in the image coordinate system in response to the touch operation, and then displays a target image after the fusion of the image to be processed 503A and the virtual object to be added corresponding to the first virtual object icon 503B 1.

In the embodiment of the present application, since the terminal device estimates that the depth information of the image to be processed is closer to the real depth information according to the depth estimation model in S403. Therefore, the shielding boundary between the virtual object and the real object in the target image obtained by fusing the terminal equipment is more accurate.

As shown in fig. 11, which is a flowchart of another image processing method provided in the embodiment of the present application, the image processing method shown in fig. 11 may be applied to a terminal device, and the image processing method shown in fig. 11 includes the following steps:

s110: a dataset is constructed.

Possible implementation manners and examples refer to descriptions in S300-S302, and are not repeated.

S111: and constructing a characteristic extraction model based on a convolutional neural network.

In this embodiment of the present application, the feature extraction model refers to the related description of the feature extraction model in S303, which is not described in detail.

S112: and constructing an interval division network.

In this embodiment of the present application, the interval-dividing network refers to the description related to the interval-dividing network in S303, which is not described in detail.

S113: the entire network model is trained using the data set.

In the embodiment of the present application, the data set is the data set constructed in S110. The whole network model is a depth estimation model comprising a feature extraction model and an interval division network.

S114: and carrying out depth estimation on the input image by using the trained model to obtain a depth map.

In the embodiment of the present application, the trained model is a depth estimation model obtained by training the depth estimation model by using a data set until convergence.

The description of the possible implementation is referred to in S403, and will not be repeated.

In the embodiment of the application, in the training process of the depth estimation model used for carrying out depth estimation on the input image, the loss of the normal vector of the depth image of the predicted depth image and the sample image is referred to, the loss of the predicted depth information and the depth information of the depth image is referred to, and the loss of the set of the intermediate values of the depth of each section after the section division network carries out section division on the depth information of the input image is also considered. Therefore, when the depth estimation model trained by the method provided by the embodiment of the application performs depth prediction of the input image, the depth information of the input image can be divided into a plurality of sections according to the feature map of the input image, and then the depth information of each pixel point comprehensively estimated according to the probability that the pixel point belongs to each section is more similar to the real depth information.

As shown in fig. 12, which is a flowchart of another image processing method provided in the embodiment of the present application, the image processing method shown in fig. 12 may be applied to a terminal device, and the image processing method shown in fig. 12 includes the following steps:

s120: and acquiring an image to be processed.

In one possible implementation manner, the terminal device acquires an image to be processed of an environment where the terminal device is located, the image being acquired by the image acquisition device.

In another possible implementation, the terminal device reads the image to be processed from the gallery.

An exemplary image to be processed is shown in fig. 13.

S121: and inputting the image to be processed into a trained feature extraction model to obtain a feature map.

In the embodiment of the present application, the trained feature extraction model is a feature extraction model in a depth estimation model trained to be converged. The feature map is the output of the feature extraction model.

Based on the example of the image to be processed in S120, the obtained feature map is shown in fig. 14.

S122: and acquiring the interval dividing number and the probability that each pixel in the image to be processed belongs to each interval according to the trained interval dividing network.

In the embodiment of the present application, the trained interval-division network is an interval-division network in a depth estimation model trained to be converged.

In a possible implementation manner, the feature map input training interval division network can output the attention feature map and the interval division number, and the probability that each pixel in the image to be processed belongs to each interval is predicted according to the attention feature map and the interval division number.

S123, obtaining depth information of the image to be processed according to the interval division number and the probability that each pixel in the image to be processed belongs to each interval.

For example, if the number of division of the intervals of the image to be processed is 5, the intermediate values of the depth information in the respective intervals of the image to be processed are 0.5,1.5,2.5,3.5 and 4.5. The probability that the pixels of the image to be processed belong to the interval corresponding to the intermediate value 0.5 is 0.8, and the probability that the pixels belong to the intervals corresponding to the rest intermediate values are 0.05. Then, the depth information of the pixel is obtained as: 0.5×0.8+1.5×0.05+2.5×0.05+3.5×0.05+4.5×0.05=1m. The method for obtaining the depth information of the rest pixels of the image to be processed is similar to the method, and is not repeated.

S124: and outputting a depth map of the image to be processed according to the depth information of the image to be processed.

Illustratively, a depth map of the image to be processed is shown in fig. 15.

In this embodiment of the present invention, the depth information of the image to be processed may be divided into a plurality of intervals according to the feature map affecting the depth information of the image to be processed, and the feature extraction model for obtaining the feature map affecting the depth information and the interval division network for dividing the depth information of the image to be processed into a plurality of intervals are all networks trained into the converged depth estimation model, so that the obtained interval division number and the probability of the pixel of the image to be processed in each interval are more approximate to the real situation, so that the depth information of the pixel obtained according to the probability of the pixel in each interval is more approximate to the real depth information, and the obtained depth map is more accurate.

The foregoing description of the solution provided in the embodiments of the present application has been mainly presented in terms of a method. To achieve the above functions, it includes corresponding hardware structures and/or software modules that perform the respective functions. Those of skill in the art will readily appreciate that the various illustrative method steps described in connection with the embodiments disclosed herein may be implemented as hardware or a combination of hardware and computer software. Whether a function is implemented as hardware or computer software driven hardware depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

The embodiment of the application may divide the functional modules of the apparatus implementing the image processing method according to the above method example, for example, each functional module may be divided corresponding to each function, or two or more functions may be integrated into one processing module. Illustratively, the functions of the target application, the drawing interface, and the display engine are integrated in the display control unit. The integrated modules may be implemented in hardware or in software functional modules. It should be noted that, in the embodiment of the present application, the division of the modules is schematic, which is merely a logic function division, and other division manners may be implemented in actual implementation.

As shown in fig. 16, which is a schematic structural diagram of an image processing apparatus according to an embodiment of the present application, a terminal device 1600 shown in fig. 16 includes a storage module 16001 and a processing module 16002, where the storage module 16001 is used for storing a depth estimation model; the depth estimation model is used for enabling the depth information of the two-dimensional image to be respectively converged on a plurality of target depth values; the target depth value corresponds to the areas in the two-dimensional image one by one; the interval is related to the depth range of the object in the two-dimensional image; the depth ranges of different intervals are different; the processing module 16002 is configured to: acquiring a two-dimensional image by using an image acquisition device; inputting the two-dimensional image into a depth estimation model; and obtaining depth information of the image to be processed by using the depth estimation model. For example, in connection with fig. 3, the storage module 16001 may be used to store a depth estimation model. The processing module 16002 may be used to perform S300-S303. In connection with fig. 4, a processing module 16002 may be used to perform S400-S404. In connection with fig. 12, a processing module 16002 may be used to perform S120-S124.

Optionally: the processing module 16002 is further configured to obtain a plurality of sample data; inputting the sample image into a neural network model to obtain predicted depth information of the sample image; obtaining a predicted depth map of the sample image according to the predicted depth information, and obtaining a marked depth map of the sample image according to the marked depth information; determining a first normal vector in the predicted depth map and a first loss of a corresponding second normal vector in the marked depth map; determining a second loss according to the labeling depth information and the predicted depth information of the sample image; determining a third loss from the plurality of first depth values in the sample image and the plurality of second depth values in the sample image; weighting and summarizing the first loss, the second loss and the third loss to obtain total loss; and iteratively updating the weight parameters of the neural network model according to the total loss to obtain a depth estimation model.

Optionally, the processing module 16002 is further configured to: receiving a special effect selection operation; displaying an icon of at least one virtual object in response to the special effect selection operation; receiving a triggering operation aiming at an icon of a virtual object to be added; marking a virtual object to be added on an interface of the terminal equipment in response to triggering operation; receiving a position selection operation for a target position in an image to be processed; the processing module 16002 specifically is configured to: depth information of the image to be processed is acquired using a depth estimation model in response to the position selection operation.

The processing module 16002 is also for: according to the image to be processed, the depth information of the image to be processed, the target position and the virtual object to be added, fusing to obtain a target image; the target image is an image after the virtual object to be added is added to the target position of the image to be processed; the image processing apparatus 1600 further includes a display module 16001 for displaying the target image.

In one example, in connection with fig. 1, the functions of the storage module 16001 described above may be implemented by the memory 120 shown in fig. 1, and the functions of the processing module 16002 may be implemented by the processor 110 shown in fig. 1 invoking a computer program stored in the memory 120. The functionality of display module 16003 may be implemented by display screen 160.

Fig. 17 is a schematic structural diagram of a chip according to an embodiment of the present application. Chip 1700 includes one or more (including two) processors 17001, communication lines 17002, and communication interfaces 17003, and optionally, chip 1700 also includes memory 17004.

In some implementations, the memory 17004 stores the following elements: executable modules or data structures, or a subset thereof, or an extended set thereof.

The methods described in the embodiments of the present application may be applied to the processor 17001 or implemented by the processor 17001. The processor 17001 may be an integrated circuit chip with signal processing capabilities. In implementation, the steps of the methods described above may be performed by integrated logic circuitry in hardware or instructions in software in the processor 17001. The processor 17001 may be a general purpose processor (e.g., a microprocessor or a conventional processor), a digital signal processor (digital signal processing, DSP), an application specific integrated circuit (application specific integrated circuit, ASIC), an off-the-shelf programmable gate array (field-programmable gate array, FPGA) or other programmable logic device, discrete gates, transistor logic, or discrete hardware components, and the processor 17001 may implement or perform the methods, steps, and logic blocks disclosed in the embodiments herein.

The steps of a method disclosed in connection with the embodiments of the present application may be embodied directly in hardware, in a decoded processor, or in a combination of hardware and software modules in a decoded processor. The software modules may be located in a state-of-the-art storage medium such as random access memory, read-only memory, programmable read-only memory, or charged erasable programmable memory (electrically erasable programmable read only memory, EEPROM). The storage medium is located in the memory 17004, and the processor 17001 reads the information in the memory 17004 and, in combination with its hardware, performs the steps of the above method.

Communication between the processor 17001, the memory 17004, and the communication interface 17003 may be via a communication line 17002.

In the above embodiments, the instructions stored by the memory for execution by the processor may be implemented in the form of a computer program product. The computer program product may be written in the memory in advance, or may be downloaded in the form of software and installed in the memory.

Embodiments of the present application also provide a computer program product comprising one or more computer instructions. When the computer program instructions are loaded and executed on a computer, the processes or functions in accordance with embodiments of the present application are produced in whole or in part. The computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable apparatus. The computer instructions may be stored in a computer-readable storage medium or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, the computer instructions may be transmitted from one website, computer, server, or data center to another website, computer, server, or data center by wired (e.g., coaxial cable, fiber optic, digital subscriber line (digital subscriber line, DSL), or wireless (e.g., infrared, wireless, microwave, etc.), or semiconductor medium (e.g., solid state disk, SSD)) or the like.

An embodiment of the present application provides an electronic device, including a processor and a memory, where the memory is configured to store a computer program, and the processor is configured to execute the computer program to perform any one of the image processing methods described above.

Embodiments of the present application also provide a computer-readable storage medium. The methods described in the above embodiments may be implemented in whole or in part by software, hardware, firmware, or any combination thereof. Computer readable media can include computer storage media and communication media and can include any medium that can transfer a computer program from one place to another. The storage media may be any target media that is accessible by a computer.

As one possible design, the computer-readable medium may include compact disk read-only memory (CD-ROM), RAM, ROM, EEPROM, or other optical disk memory; the computer readable medium may include disk storage or other disk storage devices. Moreover, any connection is properly termed a computer-readable medium. For example, if the software is transmitted from a website, server, or other remote source using a coaxial cable, fiber optic cable, twisted pair, DSL, or wireless technologies such as infrared, radio, and microwave, then the coaxial cable, fiber optic cable, twisted pair, DSL, or wireless technologies such as infrared, radio, and microwave are included in the definition of medium. Disk and disc, as used herein, includes Compact Disc (CD), laser disc, optical disc, digital versatile disc (digital versatile disc, DVD), floppy disk and blu-ray disc where disks usually reproduce data magnetically, while discs reproduce data optically with lasers.

Combinations of the above should also be included within the scope of computer-readable media. The foregoing is merely specific embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily think about changes or substitutions within the technical scope of the present application, and the changes and substitutions are intended to be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. An image processing method, the method comprising:

the terminal equipment acquires an image to be processed by using an image acquisition equipment; the image to be processed is a two-dimensional image;

the terminal equipment inputs the image to be processed into a depth estimation model, wherein the depth estimation model is used for enabling the depth information of the image to be processed to be respectively converged on a plurality of target depth values; the target depth value corresponds to the areas in the image to be processed one by one; the interval is related to the depth range of the object in the image to be processed; the depth ranges of different ones of the zones are different;

and the terminal equipment acquires the depth information of the image to be processed by using the depth estimation model.

2. The image processing method according to claim 1, wherein the depth estimation model is obtained by iteratively training a neural network model according to a plurality of sample data and a loss function until convergence;

the sample data comprises a sample image, labeling depth information labeled by the sample image and the number of labeling intervals labeled by the sample image;

the input of the neural network model includes the sample image; the output of the neural network model comprises predicted depth information of the sample image and predicted depth characteristic values of all intervals in the sample image;

the loss function is used for representing the similarity of the marked depth information and the predicted depth information and representing the similarity of the predicted depth characteristic value and the marked depth characteristic value; the noted depth feature value comprises a plurality of first depth values; the first depth value corresponds to a first interval in the sample image one by one; the first interval is obtained by dividing the sample image according to the number of the marked intervals.

3. The image processing method according to claim 2, wherein the neural network model includes a feature extraction model and a section division network;

The input of the feature extraction model includes the sample image; the output of the feature extraction model includes a feature map of the sample image; the feature map is used for characterizing the pixel level features of the sample image;

the input of the interval division network is the characteristic diagram; the output of the interval division network comprises an attention characteristic graph and an interval division number; the attention profile comprises depth-related features in the sample image;

the predicted depth information is obtained by inputting the interval division number and the attention feature map into other layers of the neural network model; the other layers are neural network layers except the characteristic extraction model and the interval division network in the neural network model;

the predicted depth feature value comprises a plurality of second depth values; the second depth value corresponds to a second interval in the sample image one by one; the second section is obtained by dividing the sample image according to the section dividing number.

4. The image processing method according to claim 3, wherein the feature extraction model includes an encoding structure and a decoding structure; the input of the coding structure is the sample image; the output of the coding structure is a feature vector used for representing semantic information of the sample image;

The input of the decoding structure is the feature vector; the output of the decoding structure is the feature map;

the encoding structure and the decoding structure are connected by a jump connection.

5. The image processing method according to claim 3 or 4, wherein the section dividing network is a network model based on an attention mechanism.

6. The image processing method according to any one of claims 3 to 5, characterized in that the method further comprises:

the terminal equipment acquires the plurality of sample data;

the terminal equipment inputs the sample image into the neural network model to obtain predicted depth information of the sample image;

the terminal equipment obtains a predicted depth map of the sample image according to the predicted depth information, and obtains a marked depth map of the sample image according to the marked depth information;

the terminal equipment determines a first loss of a first normal vector in the predicted depth map and a corresponding second normal vector in the marked depth map;

the terminal equipment determines a second loss according to the labeling depth information and the predicted depth information of the sample image;

the terminal equipment determines a third loss according to a plurality of first depth values in the sample image and a plurality of second depth values in the sample image;

The terminal equipment carries out weighted summation on the first loss, the second loss and the third loss to obtain total loss;

and the terminal equipment iteratively updates the weight parameters of the neural network model according to the total loss to obtain the depth estimation model.

7. The image processing method according to claim 6, wherein the first depth value is an intermediate value of a first depth range; the first depth range is the depth range corresponding to the first interval;

the second depth value is the middle value of the second depth range; the second depth range is the depth range corresponding to the second interval.

8. The image processing method according to any one of claims 1 to 7, wherein before the terminal device inputs the image to be processed into the depth estimation model, the method further comprises:

the terminal equipment receives special effect selection operation;

the terminal equipment responds to the special effect selection operation to display icons of at least one virtual object;

the terminal equipment receives triggering operation aiming at an icon of a virtual object to be added;

the terminal equipment responds to the triggering operation to mark the virtual object to be added on an interface of the terminal equipment;

The terminal equipment receives a position selection operation aiming at a target position in the image to be processed;

the terminal equipment acquires depth information of the image to be processed by using the depth estimation model, and the method comprises the following steps:

and the terminal equipment responds to the position selection operation and acquires the depth information of the image to be processed by utilizing the depth estimation model.

9. The image processing method according to claim 8, characterized in that the method further comprises:

the terminal equipment fuses the image to be processed, the depth information of the image to be processed, the target position and the virtual object to be added to obtain a target image; the target image is an image after the virtual object to be added is added to the target position of the image to be processed;

and the terminal equipment displays the target image.

10. An image processing device is characterized by comprising a storage module and a processing module;

the storage module is used for storing the depth estimation model; the depth estimation model is used for enabling the depth information of the two-dimensional image to be respectively converged on a plurality of target depth values; the target depth value corresponds to the interval in the two-dimensional image one by one; the interval is related to a depth range of an object in the two-dimensional image; the depth ranges of different ones of the zones are different;

The processing module is used for: acquiring the two-dimensional image by using an image acquisition device; inputting the two-dimensional image into the depth estimation model; and acquiring the depth information of the image to be processed by using the depth estimation model.

11. An electronic device, comprising: a memory for storing a computer program and a processor for executing the computer program to perform the image processing method as claimed in any one of claims 1 to 9.

12. A computer-readable storage medium storing instructions that, when executed, cause a computer to perform the image processing method of any one of claims 1 to 9.