CN116109828A

CN116109828A - Image processing method and electronic device

Info

Publication number: CN116109828A
Application number: CN202310291200.6A
Authority: CN
Inventors: 高旭
Original assignee: Honor Device Co Ltd
Current assignee: Honor Device Co Ltd
Priority date: 2023-03-23
Filing date: 2023-03-23
Publication date: 2023-05-12
Anticipated expiration: 2043-03-23
Also published as: CN116109828B; CN117173405A

Abstract

The embodiment of the application is suitable for the field of data processing, and provides an image processing method and electronic equipment.

Description

Image processing method and electronic device

Technical Field

The present application relates to the field of data processing, and more particularly, to an image processing method and an electronic apparatus.

Background

Augmented reality (Augmented Reality, AR) technology enables the rendering of computer-generated virtual objects in a real scene to achieve fusion of the virtual objects and the real scene. In general, the interest of photographing can be enhanced using AR technology.

For example, the virtual image generated by the AR technology is fused with the real portrait image obtained by actual photographing, so as to obtain the AR image with high interest. Or fusing the virtual image generated by the AR technology with the real portrait video obtained by actual shooting to obtain the AR video with high interest. For ease of understanding, the generation of an AR image is described below as an example. When the avatar and the real figure are fused, it is generally necessary to determine depth information of the figure first. When determining the depth information of the portrait, acquiring a portrait semantic mask map, and determining the depth information of the portrait according to a mask in the portrait semantic mask map. However, in some possible cases, for example, there are a plurality of figures in the image, and there is an overlapping portion between the figures. In this case, one mask in the resulting portrait semantic mask map includes a plurality of portraits, and the depth of the portraits in the mask is the average depth of the plurality of portraits. That is, the depth of the portrait obtained from the portrait semantic mask drawing is inaccurate. In this case, the dislocation of the avatar in the generated AR image is caused, and the resultant AR image is finally less effective.

Based on this, how to improve the effect of AR images in a scene of multi-person and avatar composition becomes a problem to be solved urgently.

Disclosure of Invention

The application provides an image processing method capable of improving the effect of AR images in a scene composed of multiple persons and an avatar.

In a first aspect, there is provided an image processing method including:

acquiring a first image, a first depth image and a first mask image, wherein the first image comprises an image area of a first object and an image area of a second object, the image area of the first object and the image area of the second object are partially overlapped, the first object and the second object are objects to be shot in the same category, the first depth image is used for representing depth information of the first object and the second object, and the first mask image is used for identifying the image area of the first object and the image area of the second object in the first image;

determining a first pixel point in a first image by adopting a first preset template, wherein the first pixel point is a pixel point in an image area of a first object, the first preset template is related to a first distance, and the first distance is a distance between the first object and electronic equipment;

performing super-pixel segmentation on the first image to obtain a plurality of areas, wherein the plurality of areas comprise a first area, and the first area comprises first pixel points;

determining a first region set of the plurality of regions based on first depth information, the first depth information being used to characterize a depth of the first region, a difference between a depth of each region in the first region set and the depth of the first region being less than a first threshold, the first region set comprising the first region;

Obtaining a second mask image based on the first region set, wherein the second mask image is used for identifying an image region of a first object in the first image;

and obtaining the augmented reality AR image based on the second mask image and the first image.

The image processing method provided in the embodiment of the application firstly obtains a first image, a first depth image and a first mask image, then adopts a first preset template to determine a first pixel point in the first image, simultaneously carries out super-pixel segmentation on the first image to obtain a plurality of areas, then determines a first area set in the plurality of areas based on first depth information, then obtains a second mask image based on the first area set, and further obtains an augmented reality AR image based on the second mask image and the first image, wherein the first image comprises an image area of a first object and an image area of a second object, the image area of the first object and the image area of the second object are partially overlapped, the first object and the second object are objects to be shot in the same category, the first depth image is used for representing the depth information of the first object and the second object, the first mask image identifies an image area of a first object and an image area of a second object in the first image, the first pixel points are pixels in the image area of the first object, the first preset template is related to a first distance, the first distance is a distance between the first object and the electronic device, the plurality of areas comprise first areas, the first areas comprise first pixels, first depth information is used for representing the depth of the first areas, the difference value between the depth of each area in the first area set and the depth of the first area is smaller than a first threshold value, the first area set comprises the first areas, the second mask image is used for identifying the image area of the first object in the first image, since the first pixels of the image area of the first object in the first image are acquired through the first preset template, the first region set determined based on the depth of the first region where the first pixel point is located is an image region of the same object, and image regions of other objects are not included, so that masks included in a second mask image obtained based on the first region set only indicate the image region of the first object independently, the situation that the depth of the first object is determined according to the depth average value of at least two objects is avoided, the accuracy of the depth of the first object is improved, the problem of dislocation of an virtual image in a generated AR image is avoided, and the effect of the generated AR image is improved.

With reference to the first aspect, in an embodiment of the first aspect, if the first object and the second object are human, the first preset template is a head contour template, and the first distance is inversely proportional to an area of the first preset template.

With reference to the first aspect, in an embodiment of the first aspect, the method further includes:

acquiring pose information of electronic equipment;

the shape of the head contour template is adjusted based on the pose information.

In the embodiment of the application, according to the pose information of the electronic equipment, the first preset template is adjusted, the adjusted first preset template is obtained, so that the angle of the first preset template is close to the shooting angle of the first image, and then the head outline of the portrait in the adjusted first preset template and the head outline of the portrait in the first image are more matched, so that the first pixel point can be obtained more accurately based on the adjusted first preset template, the accuracy of the second mask image obtained based on the first pixel point is improved, and the accuracy of the AR image obtained based on the second mask image and the first image is improved.

dividing each depth value in the first depth image into a plurality of depth intervals according to a preset interval range;

Determining a target depth interval with the largest number of pixels in the plurality of depth intervals;

and determining a first preset template according to the depth range of the target depth interval.

In the embodiment of the application, the first preset template is a distance between the first object and the electronic device, namely, the first distance is related to the first distance, the larger the first distance is, the smaller the area of the head outline template is, and the larger the area of the head outline template is, therefore, the corresponding first preset template is directly selected according to the first distance, the size of the first preset template is more matched with the size of the first object, the efficiency of determining the first pixel point according to the first preset template is improved, and the efficiency of determining the second mask image according to the first pixel point is further improved.

With reference to the first aspect, in an embodiment of the first aspect, determining, using a first preset template, a first pixel point in the first image includes:

acquiring an edge map corresponding to the first mask image, wherein the edge map is a binary map formed by the outline of the mask in the first mask image;

and carrying out convolution operation on the first preset template and the edge map to obtain a first pixel point.

The edge map is a binary map formed by outlines of masks in the first mask image. The first pixel point is a pixel point in an image area of the first object.

obtaining a depth difference value of the same mask in the first mask image;

and if the depth difference value is larger than a first preset threshold value, determining that the first image comprises the first object and the second object.

According to the image processing method provided by the embodiment of the application, the depth difference value of the same mask in the first mask image is obtained, if the depth difference value is larger than the first preset threshold value, the first image is determined to comprise the first object and the second object, which is equivalent to the process of determining whether the first image comprises the first object and the second object at the same time, the process is determined according to the depth difference value of the same mask in the first mask image, and the intelligence of determining that the first object and the second object exist in the first image is improved.

determining a second pixel point in the first image by adopting a second preset template, wherein the second pixel point is a pixel point in an image area of a second object, the second preset template is related to a second distance, and the second distance is the distance between the second object and the electronic equipment;

the plurality of regions further comprises a second region, and the second region comprises second pixel points;

Determining a second set of regions of the plurality of regions based on second depth information, the second depth information being used to characterize a depth of the second region, a difference between a depth of each region of the second set of regions and a depth of the second region being less than a second threshold, the second set of regions comprising the second region;

obtaining a second mask image based on the first set of regions, comprising:

and obtaining a second mask image based on the first region set and the second region set.

According to the image processing method provided by the embodiment of the application, after the first preset template is determined according to the depth range of the target depth interval, the second preset template of the second object is further determined, further, on the basis of carrying out convolution operation on the edge image by adopting the first preset template, the edge image can also be subjected to convolution operation by adopting the second preset template, so that second pixel points in the image area of the second object are obtained, after the first image is subjected to super-pixel segmentation, the obtained multiple areas comprise the first area and the second area, further, the mask of the first object is obtained based on the first area, the mask of the second object is obtained based on the second area, the image areas of the first object and the second object are identified simultaneously in the second mask image, further, on the basis of generating the AR image based on the image area of the first object, the AR image can also be generated simultaneously based on the image area of the second object, and further improving the display effect of the generated AR image.

With reference to the first aspect, in an embodiment of the first aspect, determining a second pixel point in the first image using a second preset template includes:

acquiring an edge map corresponding to the first mask image, wherein the edge map is a binary map formed by the outline of the object in the first mask image;

and carrying out convolution operation on the second preset template and the edge map to obtain a second pixel point.

With reference to the first aspect, in an embodiment of the first aspect, the first depth image is acquired by an image capturing device in the electronic device.

According to the image processing method provided by the embodiment of the application, the first depth image is acquired by the imaging device (TOF) in the electronic equipment, so that the method for acquiring the first depth image is more convenient, and the AR image acquiring efficiency according to the first depth image, the first image and the first mask image is further improved.

With reference to the first aspect, in an embodiment of the first aspect, the first depth image is obtained by inputting the first image into a preset depth estimation model, and the depth estimation model is a neural network model.

According to the image processing method provided by the embodiment of the application, the first depth image is obtained by inputting the first image into the preset depth estimation model, wherein the preset depth estimation model is the neural network model, so that the electronic equipment can obtain the corresponding first depth image based on the first image and the preset depth estimation model under the condition that TOF is not configured, the convenience of acquiring the first depth image is improved, and the cost of the electronic equipment is reduced.

With reference to the first aspect, in one embodiment of the first aspect, the first mask image is obtained by inputting the first image into a preset semantic segmentation model, and the semantic segmentation model is a neural network model.

According to the image processing method provided by the embodiment of the application, the first mask image is obtained by inputting the first image into the preset semantic segmentation model, wherein the preset semantic segmentation model is the neural network model, so that the electronic equipment can obtain the corresponding first mask image based on the first image and the preset semantic segmentation model, and the convenience and accuracy for acquiring the first mask image are improved.

In a second aspect, there is provided an image processing apparatus comprising means for performing any of the methods of the first aspect. The device can be a server, terminal equipment or a chip in the terminal equipment. The apparatus may include an input unit and a processing unit.

When the apparatus is a terminal device, the processing unit may be a processor, and the input unit may be a communication interface; the terminal device may further comprise a memory for storing computer program code which, when executed by the processor, causes the terminal device to perform any of the methods of the first aspect.

When the device is a chip in the terminal device, the processing unit may be a processing unit inside the chip, and the input unit may be an output interface, a pin, a circuit, or the like; the chip may also include memory, which may be memory within the chip (e.g., registers, caches, etc.), or memory external to the chip (e.g., read-only memory, random access memory, etc.); the memory is for storing computer program code which, when executed by the processor, causes the chip to perform any of the methods of the first aspect.

In one possible implementation, the memory is used to store computer program code; a processor executing the computer program code stored in the memory, the processor, when executed, configured to perform: acquiring a first image, a first depth image and a first mask image, wherein the first image comprises an image area of a first object and an image area of a second object, the image area of the first object and the image area of the second object are partially overlapped, the first object and the second object are objects to be shot in the same category, the first depth image is used for representing depth information of the first object and the second object, and the first mask image is used for identifying the image area of the first object and the image area of the second object in the first image; determining a first pixel point in a first image by adopting a first preset template, wherein the first pixel point is a pixel point in an image area of a first object, the first preset template is related to a first distance, and the first distance is a distance between the first object and electronic equipment; performing super-pixel segmentation on the first image to obtain a plurality of areas, wherein the plurality of areas comprise a first area, and the first area comprises first pixel points; determining a first region set of the plurality of regions based on first depth information, the first depth information being used to characterize a depth of the first region, a difference between a depth of each region in the first region set and the depth of the first region being less than a first threshold, the first region set comprising the first region; obtaining a second mask image based on the first region set, wherein the second mask image is used for identifying an image region of a first object in the first image; and obtaining the augmented reality AR image based on the second mask image and the first image.

In a third aspect, there is provided a computer-readable storage medium storing computer program code which, when executed by an image processing apparatus, causes the image processing apparatus to perform any one of the image processing methods of the first aspect.

In a fourth aspect, there is provided a computer program product comprising: computer program code which, when run by an image processing apparatus, causes the image processing apparatus to perform any one of the apparatus methods of the first aspect.

The image processing method and the electronic device provided in the embodiments of the present application, first obtain a first image, a first depth image and a first mask image, then determine a first pixel point in the first image by using a first preset template, and at the same time, perform super-pixel segmentation on the first image to obtain a plurality of regions, then determine a first region set in the plurality of regions based on first depth information, and then obtain a second mask image based on the first region set, and further obtain an augmented reality AR image based on the second mask image and the first image, wherein the first image includes an image region of the first object and an image region of the second object, the image region of the first object and the image region of the second object partially overlap, the first object and the second object are objects to be photographed in the same class, the first depth image is used for characterizing depth information of the first object and the second object, the first mask image identifies an image area of a first object and an image area of a second object in the first image, the first pixel points are pixels in the image area of the first object, the first preset template is related to a first distance, the first distance is a distance between the first object and the electronic device, the plurality of areas comprise first areas, the first areas comprise first pixels, first depth information is used for representing the depth of the first areas, the difference value between the depth of each area in the first area set and the depth of the first area is smaller than a first threshold value, the first area set comprises the first areas, the second mask image is used for identifying the image area of the first object in the first image, since the first pixels of the image area of the first object in the first image are acquired through the first preset template, the first region set determined based on the depth of the first region where the first pixel point is located is an image region of the same object, and image regions of other objects are not included, so that masks included in a second mask image obtained based on the first region set only indicate the image region of the first object independently, the situation that the depth of the first object is determined according to the depth average value of at least two objects is avoided, the accuracy of the depth of the first object is improved, the problem of dislocation of an virtual image in a generated AR image is avoided, and the effect of the generated AR image is improved.

Drawings

FIG. 1 is a schematic illustration of an RGB image and mask map;

FIG. 2 is a schematic diagram of an AR image generated from a multi-person image at this stage;

FIG. 3 is a schematic diagram of a hardware system suitable for use with the electronic device of the present application;

FIG. 4 is a schematic diagram of a software system suitable for use with the electronic device of the present application;

fig. 5 is a schematic diagram of an application scenario provided in an embodiment of the present application;

fig. 6 is a schematic flow chart of an image processing method according to an embodiment of the present application;

fig. 7 is a schematic flow chart of acquiring a first depth image and a first mask image according to an embodiment of the present application;

FIG. 8 is a schematic diagram of a first mask image according to an embodiment of the present application;

FIG. 9 is a schematic diagram of a head profile template provided in an embodiment of the present application;

FIG. 10 is a schematic diagram of a first mask image and an edge map provided in an embodiment of the present application;

FIG. 11 is a schematic diagram of convolving an edge map with a first preset template according to an embodiment of the present application;

FIG. 12 is a schematic diagram of a convolution of an edge map with a first preset template of different sizes according to an embodiment of the present application;

FIG. 13 is a schematic illustration of a first image for superpixel segmentation provided in an embodiment of the present application;

FIG. 14 is a flowchart of another image processing method according to an embodiment of the present disclosure;

fig. 15 is a flowchart of another image processing method according to an embodiment of the present application;

FIG. 16 is a schematic view of an image processing apparatus provided herein;

fig. 17 is a schematic diagram of an electronic device for image processing provided herein.

Detailed Description

The technical solutions in the embodiments of the present application will be described below with reference to the drawings in the embodiments of the present application. Wherein, in the description of the embodiments of the present application, "/" means or is meant unless otherwise indicated, for example, a/B may represent a or B; "and/or" herein is merely an association relationship describing an association object, and means that three relationships may exist, for example, a and/or B may mean: a exists alone, A and B exist together, and B exists alone. In addition, in the description of the embodiments of the present application, "plurality" means two or more than two.

The terms "first," "second," "third," and the like, are used below for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defining "a first", "a second", or a third "may explicitly or implicitly include one or more such feature.

For ease of understanding, the description of the concepts related to the embodiments of the present application is given in part by way of example for reference.

1. Depth map

The depth map refers to an image in which the distance from the image collector to each point in the image scene is taken as a pixel value.

2. Mask (mask)

In the field of image processing, an area where a target object in an image is located is assigned, so that the area is different from other areas in the image in assignment, and the target object can be distinguished from other areas in the image, thereby facilitating subsequent rendering processing of the image. Illustratively, the region where the target image is located is assigned a 1, and the other regions in the image are assigned 0. Wherein the target object may refer to a person, an animal, a plant, a building, etc.

3. Semantic segmentation

Semantic segmentation refers to a method of separating a target object from other regions in an image. Unlike instance segmentation, semantic segmentation merely distinguishes between a target object and other regions, and in the case of multiple target objects, it is not possible to distinguish between different objects.

For example, semantic segmentation may distinguish a person image from other background regions in an image. As shown in fig. 1 (a), the RGB image includes 2 persons, respectively, a portrait 1 and a portrait 2. Only the portrait and the background area can be distinguished by semantic division, and a mask diagram as shown in (b) of fig. 1 is obtained. Wherein, there is no obvious boundary between the portrait 1 and the portrait 2, so the portrait 1 and the portrait 2 cannot be distinguished.

It should be appreciated that generally the computational effort required for semantic segmentation is lower and the computational effort required for instance segmentation is higher. Therefore, when image processing is performed in a terminal device with limited computing power such as a mobile phone and a tablet computer, a semantic segmentation mode is generally adopted.

4. Super pixel segmentation

In the field of computer vision, superpixel segmentation refers to the process of subdividing a digital image into a plurality of image sub-regions (a collection of pixel points, also called superpixels). Super-pixels are small areas composed of a series of pixel points that are adjacent in position and similar in color, brightness, texture, etc. These small areas mostly retain the effective information for further image segmentation and do not generally destroy the boundary information of the objects in the image.

In the current process of fusing the virtual image and the real image generated by the AR technology, the depth of the virtual image in the image is usually determined according to the depth of the image, and then the virtual image is set at a proper depth, so that the cartoon image and the image can be fused in the AR image better. In one possible case, when a plurality of persons are included in one image, there is an overlapping area between the plurality of persons. As illustrated in fig. 2 (a), for example. The image includes two characters, namely, a character 1 and a character 2, and an overlapping area exists between the character 1 and the character 2. The distance between the person 1 and the camera is closer, and the distance between the person 2 and the camera is farther, namely, the depth difference between the person 1 and the person 2 is larger. In fusing the avatar 3 in the image, it is generally necessary to acquire the depth maps of the person 1 and the person 2 first, and then to acquire an average value of the depths between the person 1 and the person 2, which is required to be referred to in setting the avatar 3 in the AR image. In one possible case, the depth map obtained by using the semantic segmentation method does not distinguish the person 1 from the person 2, and the corresponding depth map is shown in (b) of fig. 2, so that the situation of person adhesion occurs. Therefore, the average value of the depth between the person 1 and the person 2 is greatly different from the depth of both the person 1 and the person 2, and thus the virtual image 3 is likely to be dislocated, resulting in poor effect of the finally generated AR image. For example, the average depth of the person 1 and the person 2 is 2.55m, wherein the true depth of the person 1 is 2m, and the true depth of the person 2 is 3.1m. When the avatar is placed at a depth of 2.6m, the avatar may be dislocated, as shown in (c) of fig. 2. In one possible scenario, the acquired depth map is simply partitioned based on the edges of the person, as shown in fig. 2 (d). In this case, when the avatar is set in the image, it may cause a part of the torso of the person to be blocked by the avatar, resulting in poor effect of the finally generated AR image. As illustrated in fig. 2 (e), for example.

In view of this, the embodiment of the present application provides an image processing method, first obtain a first image, a first depth image and a first mask image, then use a first preset template to determine a first pixel point in the first image, and at the same time, perform super-pixel segmentation on the first image to obtain a plurality of areas, then determine a first area set in the plurality of areas based on the first depth information, then obtain a second mask image based on the first area set, and then obtain an augmented reality AR image based on the second mask image and the first image, where the image area of the first image including a first object overlaps with the image area of the second object, the first object and the second object are the same class of objects to be photographed, the first depth image is used to characterize depth information of the first object and the second object, the first mask image identifies the image area of the first object and the image area of the second object in the first image, the first pixel point is a pixel point in the image area of the first object, the first preset template and the first image area of the first object is a first depth information, the first depth image is used to characterize the depth information of the first object in the first image area of the first object, the first depth image is used to characterize the depth information of the first object and the first object in the first image area of the first object, the first depth image is used to identify the depth information of the first pixel point in the first image area of the first object, the first area of the first object is based on the first area, and the first area set of the first area, and the first area is based on the first area, and the first area set, and second image is first area. The first region set determined based on the depth of the first region where the first pixel point is located is an image region of the same object, and image regions of other objects are not included, so that masks included in a second mask image obtained based on the first region set only indicate the image region of the first object independently, the situation that the depth of the first object is determined according to the depth average value of at least two objects is avoided, the accuracy of the depth of the first object is improved, the problem of dislocation of an virtual image in a generated AR image is avoided, and the effect of the generated AR image is improved.

The image processing method provided by the embodiment of the application can be applied to electronic equipment. Optionally, the electronic device includes a terminal device, which may also be referred to as a terminal (terminal), a User Equipment (UE), a Mobile Station (MS), a Mobile Terminal (MT), and so on. The terminal device may be a mobile phone, a smart television, a wearable device, a tablet (Pad), a computer with wireless transceiving function, a Virtual Reality (VR) terminal device, an augmented reality (augmented reality, AR) terminal device, a wireless terminal in industrial control (industrial control), a wireless terminal in unmanned driving (self-driving), a wireless terminal in teleoperation (remote medical surgery), a wireless terminal in smart grid (smart grid), a wireless terminal in transportation safety (transportation safety), a wireless terminal in smart city (smart city), a wireless terminal in smart home (smart home), or the like. The embodiment of the application does not limit the specific technology and the specific equipment form adopted by the terminal equipment.

By way of example, fig. 3 shows a schematic structural diagram of the electronic device 100. The electronic device 100 may include a processor 110, an external memory interface 120, an internal memory 121, a universal serial bus (universal serial bus, USB) interface 130, a charge management module 140, a power management module 141, a battery 142, an antenna 1, an antenna 2, a mobile communication module 150, a wireless communication module 160, an audio module 170, a speaker 170A, a receiver 170B, a microphone 170C, an earphone interface 170D, a sensor module 180, keys 190, a motor 191, an indicator 192, a camera 193, a display 194, and a subscriber identity module (subscriber identification module, SIM) card interface 195, etc. The sensor module 180 may include a pressure sensor 180A, a gyro sensor 180B, an air pressure sensor 180C, a magnetic sensor 180D, an acceleration sensor 180E, a distance sensor 180F, a proximity sensor 180G, a fingerprint sensor 180H, a temperature sensor 180J, a touch sensor 180K, an ambient light sensor 180L, a bone conduction sensor 180M, and the like.

It is to be understood that the structure illustrated in the embodiments of the present application does not constitute a specific limitation on the electronic device 100. In other embodiments of the present application, electronic device 100 may include more or fewer components than shown, or certain components may be combined, or certain components may be split, or different arrangements of components. The illustrated components may be implemented in hardware, software, or a combination of software and hardware.

The processor 110 may include one or more processing units, such as: the processor 110 may include an application processor (application processor, AP), a modem processor, a graphics processor (graphics processing unit, GPU), an image signal processor (image signal processor, ISP), a controller, a memory, a video codec, a digital signal processor (digital signal processor, DSP), a baseband processor, and/or a neural network processor (neural-network processing unit, NPU), etc. Wherein the different processing units may be separate devices or may be integrated in one or more processors.

The controller may be a neural hub and a command center of the electronic device 100, among others. The controller can generate operation control signals according to the instruction operation codes and the time sequence signals to finish the control of instruction fetching and instruction execution.

A memory may also be provided in the processor 110 for storing instructions and data. In some embodiments, the memory in the processor 110 is a cache memory. The memory may hold instructions or data that the processor 110 has just used or recycled. If the processor 110 needs to reuse the instruction or data, it can be called directly from the memory. Repeated accesses are avoided and the latency of the processor 110 is reduced, thereby improving the efficiency of the system.

In some embodiments, the processor 110 may include one or more interfaces. The interfaces may include an integrated circuit (inter-integrated circuit, I2C) interface, an integrated circuit built-in audio (inter-integrated circuit sound, I2S) interface, a pulse code modulation (pulse code modulation, PCM) interface, a universal asynchronous receiver transmitter (universal asynchronous receiver/transmitter, UART) interface, a mobile industry processor interface (mobile industry processor interface, MIPI), a general-purpose input/output (GPIO) interface, a subscriber identity module (subscriber identity module, SIM) interface, and/or a universal serial bus (universal serial bus, USB) interface, among others.

The I2C interface is a bi-directional synchronous serial bus comprising a serial data line (SDA) and a serial clock line (derail clock line, SCL). In some embodiments, the processor 110 may contain multiple sets of I2C buses. The processor 110 may be coupled to the touch sensor 180K, charger, flash, camera 193, etc., respectively, through different I2C bus interfaces. For example: the processor 110 may be coupled to the touch sensor 180K through an I2C interface, such that the processor 110 communicates with the touch sensor 180K through an I2C bus interface to implement a touch function of the electronic device 100.

The UART interface is a universal serial data bus for asynchronous communications. The bus may be a bi-directional communication bus. It converts the data to be transmitted between serial communication and parallel communication. In some embodiments, a UART interface is typically used to connect the processor 110 with the wireless communication module 160. For example: the processor 110 communicates with a bluetooth module in the wireless communication module 160 through a UART interface to implement a bluetooth function. In some embodiments, the audio module 170 may transmit an audio signal to the wireless communication module 160 through a UART interface, to implement a function of playing music through a bluetooth headset.

The MIPI interface may be used to connect the processor 110 to peripheral devices such as a display 194, a camera 193, and the like. The MIPI interfaces include camera serial interfaces (camera serial interface, CSI), display serial interfaces (display serial interface, DSI), and the like. In some embodiments, processor 110 and camera 193 communicate through a CSI interface to implement the photographing functions of electronic device 100. The processor 110 and the display 194 communicate via a DSI interface to implement the display functionality of the electronic device 100.

The GPIO interface may be configured by software. The GPIO interface may be configured as a control signal or as a data signal. In some embodiments, a GPIO interface may be used to connect the processor 110 with the camera 193, the display 194, the wireless communication module 160, the audio module 170, the sensor module 180, and the like. The GPIO interface may also be configured as an I2C interface, an I2S interface, a UART interface, an MIPI interface, etc.

It should be understood that the interfacing relationship between the modules illustrated in the embodiments of the present application is only illustrative, and does not limit the structure of the electronic device 100. In other embodiments of the present application, the electronic device 100 may also use different interfacing manners, or a combination of multiple interfacing manners in the foregoing embodiments.

The electronic device 100 implements display functions through a GPU, a display screen 194, an application processor, and the like. The GPU is a microprocessor for image processing, and is connected to the display 194 and the application processor. The GPU is used to perform mathematical and geometric calculations for graphics rendering. Processor 110 may include one or more GPUs that execute program instructions to generate or change display information.

The display screen 194 is used to display images, videos, and the like. The display 194 includes a display panel. The display panel may employ a liquid crystal display (liquid crystal display, LCD), an organic light-emitting diode (OLED), an active-matrix organic light-emitting diode (AMOLED) or an active-matrix organic light-emitting diode (matrix organic light emitting diode), a flexible light-emitting diode (flex), a mini, a Micro led, a Micro-OLED, a quantum dot light-emitting diode (quantum dot light emitting diodes, QLED), or the like. In some embodiments, the electronic device 100 may include 1 or N display screens 194, N being a positive integer greater than 1.

The electronic device 100 may implement photographing functions through an ISP, a camera 193, a video codec, a GPU, a display screen 194, an application processor, and the like.

The ISP is used to process data fed back by the camera 193. For example, when photographing, the shutter is opened, light is transmitted to the camera photosensitive element through the lens, the optical signal is converted into an electric signal, and the camera photosensitive element transmits the electric signal to the ISP for processing and is converted into an image visible to naked eyes. ISP can also optimize the noise, brightness and skin color of the image. The ISP can also optimize parameters such as exposure, color temperature and the like of a shooting scene. In some embodiments, the ISP may be provided in the camera 193.

The camera 193 is used to capture still images or video. The object generates an optical image through the lens and projects the optical image onto the photosensitive element. The photosensitive element may be a charge coupled device (charge coupled device, CCD) or a Complementary Metal Oxide Semiconductor (CMOS) phototransistor. The photosensitive element converts the optical signal into an electrical signal, which is then transferred to the ISP to be converted into a digital image signal. The ISP outputs the digital image signal to the DSP for processing. The DSP converts the digital image signal into an image signal in a standard RGB, YUV, or the like format. In some embodiments, electronic device 100 may include 1 or N cameras 193, N being a positive integer greater than 1.

The digital signal processor is used for processing digital signals, and can process other digital signals besides digital image signals. For example, when the electronic device 100 selects a frequency bin, the digital signal processor is used to fourier transform the frequency bin energy, or the like.

Video codecs are used to compress or decompress digital video. The electronic device 100 may support one or more video codecs. In this way, the electronic device 100 may play or record video in a variety of encoding formats, such as: dynamic picture experts group (moving picture experts group, MPEG) 1, MPEG2, MPEG3, MPEG4, etc.

The NPU is a neural-network (NN) computing processor, and can rapidly process input information by referencing a biological neural network structure, for example, referencing a transmission mode between human brain neurons, and can also continuously perform self-learning. Applications such as intelligent awareness of the electronic device 100 may be implemented through the NPU, for example: image recognition, face recognition, speech recognition, text understanding, etc.

The electronic device 100 may implement audio functions through an audio module 170, a speaker 170A, a receiver 170B, a microphone 170C, an earphone interface 170D, an application processor, and the like. Such as music playing, recording, etc.

The audio module 170 is used to convert digital audio information into an analog audio signal output and also to convert an analog audio input into a digital audio signal. The audio module 170 may also be used to encode and decode audio signals. In some embodiments, the audio module 170 may be disposed in the processor 110, or a portion of the functional modules of the audio module 170 may be disposed in the processor 110.

The pressure sensor 180A is used to sense a pressure signal, and may convert the pressure signal into an electrical signal. In some embodiments, the pressure sensor 180A may be disposed on the display screen 194. The pressure sensor 180A is of various types, such as a resistive pressure sensor, an inductive pressure sensor, a capacitive pressure sensor, and the like. The capacitive pressure sensor may be a capacitive pressure sensor comprising at least two parallel plates with conductive material. The capacitance between the electrodes changes when a force is applied to the pressure sensor 180A. The electronic device 100 determines the strength of the pressure from the change in capacitance. When a touch operation is applied to the display screen 194, the electronic apparatus 100 detects the touch operation intensity according to the pressure sensor 180A. The electronic device 100 may also calculate the location of the touch based on the detection signal of the pressure sensor 180A. In some embodiments, touch operations that act on the same touch location, but at different touch operation strengths, may correspond to different operation instructions. For example: and executing an instruction for checking the short message when the touch operation with the touch operation intensity smaller than the first pressure threshold acts on the short message application icon. And executing an instruction for newly creating the short message when the touch operation with the touch operation intensity being greater than or equal to the first pressure threshold acts on the short message application icon.

The gyro sensor 180B may be used to determine a motion gesture of the electronic device 100. In some embodiments, the angular velocity of electronic device 100 about three axes (i.e., x, y, and z axes) may be determined by gyro sensor 180B. The gyro sensor 180B may be used for photographing anti-shake. For example, when the shutter is pressed, the gyro sensor 180B detects the shake angle of the electronic device 100, calculates the distance to be compensated by the lens module according to the angle, and makes the lens counteract the shake of the electronic device 100 through the reverse motion, so as to realize anti-shake. The gyro sensor 180B may also be used for navigating, somatosensory game scenes. In one possible scenario, the angular velocity of the electronic device acquired by the gyro sensor 180B may be used to adjust the angle of the object in the image. For example, the vertical direction of the object in the image is adjusted to the gravitational direction based on the angular velocity of the electronic device acquired by the gyro sensor 180B.

The air pressure sensor 180C is used to measure air pressure. In some embodiments, electronic device 100 calculates altitude from barometric pressure values measured by barometric pressure sensor 180C, aiding in positioning and navigation.

The magnetic sensor 180D includes a hall sensor. The electronic device 100 may detect the opening and closing of the flip cover using the magnetic sensor 180D. In some embodiments, when the electronic device 100 is a flip machine, the electronic device 100 may detect the opening and closing of the flip according to the magnetic sensor 180D. And then according to the detected opening and closing state of the leather sheath or the opening and closing state of the flip, the characteristics of automatic unlocking of the flip and the like are set.

The acceleration sensor 180E may detect the magnitude of acceleration of the electronic device 100 in various directions (typically three axes). The magnitude and direction of gravity may be detected when the electronic device 100 is stationary. The electronic equipment gesture recognition method can also be used for recognizing the gesture of the electronic equipment, and is applied to horizontal and vertical screen switching, pedometers and other applications. In one possible scenario, the pose information of the electronic device acquired by the acceleration sensor 180E may be used to adjust the angle of the object in the image. For example, the vertical direction of the object in the image is adjusted to the gravitational direction based on the angular velocity of the electronic device acquired by the acceleration sensor 180E.

A distance sensor 180F for measuring a distance. The electronic device 100 may measure the distance by infrared or laser. In some embodiments, the electronic device 100 may range using the distance sensor 180F to achieve quick focus.

The proximity light sensor 180G may include, for example, a Light Emitting Diode (LED) and a light detector, such as a photodiode. The light emitting diode may be an infrared light emitting diode. The electronic device 100 emits infrared light outward through the light emitting diode. The electronic device 100 detects infrared reflected light from nearby objects using a photodiode. When sufficient reflected light is detected, it may be determined that there is an object in the vicinity of the electronic device 100. When insufficient reflected light is detected, the electronic device 100 may determine that there is no object in the vicinity of the electronic device 100. The electronic device 100 can detect that the user holds the electronic device 100 close to the ear by using the proximity light sensor 180G, so as to automatically extinguish the screen for the purpose of saving power. The proximity light sensor 180G may also be used in holster mode, pocket mode to automatically unlock and lock the screen.

The ambient light sensor 180L is used to sense ambient light level. The electronic device 100 may adaptively adjust the brightness of the display 194 based on the perceived ambient light level. The ambient light sensor 180L may also be used to automatically adjust white balance when taking a photograph. Ambient light sensor 180L may also cooperate with proximity light sensor 180G to detect whether electronic device 100 is in a pocket to prevent false touches.

The fingerprint sensor 180H is used to collect a fingerprint. The electronic device 100 may utilize the collected fingerprint feature to unlock the fingerprint, access the application lock, photograph the fingerprint, answer the incoming call, etc.

The touch sensor 180K, also referred to as a "touch panel". The touch sensor 180K may be disposed on the display screen 194, and the touch sensor 180K and the display screen 194 form a touch screen, which is also called a "touch screen". The touch sensor 180K is for detecting a touch operation acting thereon or thereabout. The touch sensor may communicate the detected touch operation to the application processor to determine the touch event type. Visual output related to touch operations may be provided through the display 194. In other embodiments, the touch sensor 180K may also be disposed on the surface of the electronic device 100 at a different location than the display 194.

The keys 190 include a power-on key, a volume key, etc. The keys 190 may be mechanical keys. Or may be a touch key. The electronic device 100 may receive key inputs, generating key signal inputs related to user settings and function controls of the electronic device 100.

The motor 191 may generate a vibration cue. The motor 191 may be used for incoming call vibration alerting as well as for touch vibration feedback. For example, touch operations acting on different applications (e.g., photographing, audio playing, etc.) may correspond to different vibration feedback effects. The motor 191 may also correspond to different vibration feedback effects by touching different areas of the display screen 194. Different application scenarios (such as time reminding, receiving information, alarm clock, game, etc.) can also correspond to different vibration feedback effects. The touch vibration feedback effect may also support customization.

It should be noted that any of the electronic devices mentioned in the embodiments of the present application may include more or fewer modules in the electronic device 100.

The software system of the electronic device 100 may employ a layered architecture, an event driven architecture, a microkernel architecture, a microservice architecture, or a cloud architecture. In this embodiment, taking an Android system with a layered architecture as an example, a software structure of the electronic device 100 is illustrated.

Fig. 4 is a software configuration block diagram of the electronic device 100 of the embodiment of the present application.

The layered architecture of the electronic device 100 divides the software into several layers, each with a distinct role and division of labor. The layers communicate with each other through a software interface. In some embodiments, the software system may be divided into five layers, from top to bottom, an application layer 210, an application framework layer 220, a hardware abstraction layer 230, a driver layer 240, and a hardware layer 250, respectively.

The application layer 210 may include a camera, gallery application, and may also include calendar, phone call, map, navigation, WLAN, bluetooth, music, video, short message, virtual reality AR, etc. applications.

The application framework layer 220 provides an application access interface and programming framework for the applications of the application layer 210.

For example, the application framework layer 220 includes a camera access interface for providing a photographing service of a camera through camera management and a camera device.

Camera management in the application framework layer 220 is used to manage cameras. The camera management may obtain parameters of the camera, for example, determine an operating state of the camera, and the like.

The camera devices in the application framework layer 220 are used to provide a data access interface between the different camera devices and camera management.

The hardware abstraction layer 230 is used to abstract the hardware. For example, the hardware abstraction layer 230 may include a camera hardware abstraction layer and other hardware device abstraction layers; camera device 1, camera device 2, AR device, etc. may be included in the camera hardware abstraction layer. It is understood that the camera device 1, the camera device 2, the AR device, and the like may be functional modules implemented by software or a combination of software and hardware. When the electronic device is started up, or in the running process of the electronic device, a corresponding process is created or a corresponding flow is started, so that the functions corresponding to the camera device 1, the camera device 2 and the AR device are realized. Specifically, the AR device is used for combining the avatar and the actual image to generate an AR image, or generating an AR video.

The camera hardware abstraction layer may be coupled to an algorithm library, and the camera hardware abstraction layer may invoke algorithms in the algorithm library. In the embodiment of the application, the camera algorithm library comprises an image processing algorithm. The image processing algorithm may include a depth estimation model, a semantic segmentation model, and the like.

The depth estimation model may process the RGB image to obtain a depth map corresponding to the RGB image.

The semantic segmentation model can process the RGB image to obtain a mask image corresponding to the RGB image. The semantic segmentation model can only distinguish objects from the background in the RGB image, and cannot distinguish a few objects in the RGB image. For example, a mask map obtained by using a semantic segmentation model has a portrait value of 1 and a background value of 0, which corresponds to that different portraits are all assigned 1. Therefore, the mask map obtained by using the semantic segmentation model is prone to the situation that objects in the map are stuck, as shown in (b) of fig. 1.

It will be appreciated that, whether the electronic device 100 captures and displays a video stream, including multiple frames of images, continuously, whether the video is a photo preview, a video preview, or a video. Therefore, in the embodiment of the present application, each image in the video stream is referred to as a frame image, and the nth image in the video stream is referred to as an nth frame image.

Of course, other preset algorithms, such as a purplish-edge removing algorithm, a saturation adjusting algorithm, and the like, may also be included in the camera algorithm library, which is not limited in any way in the embodiments of the present application.

The driver layer 240 is used to provide drivers for different hardware devices. For example, the drive layer may include a camera drive; a digital signal processor driver and a graphics processor driver.

The hardware layer 250 may include sensors, image signal processors, digital signal processors, graphics processors, and other hardware devices. The sensors may include a sensor 1, a sensor 2, etc., and may also include a depth sensor (TOF) and a multispectral sensor.

The depth map can be directly acquired by TOF.

For example, after the user presses the photographing control, an RGB image is acquired by the sensor 1 at the same time, and a depth map corresponding to the RGB image is acquired by the TOF.

The workflow of the software system of the electronic device 100 is exemplified below in connection with the generation process of an AR image in a photographed scene.

When a user performs a click operation on a control of the AR APP on the touch sensor 180K, the AR APP wakes up by the click operation and enters a photographing preview mode. The AR APP calls each camera device of the camera hardware abstraction layer through the camera access interface, and loads options of different AR effects. Illustratively, the camera hardware abstraction layer determines that the current zoom factor is less than 0.6 zoom factor, and thus, may issue an instruction to invoke the wide-angle camera through a driver to the camera device. At the same time, the camera algorithm library starts to load the algorithms in the camera algorithm library.

After the sensor of the hardware layer is called, for example, after the sensor 1 in the wide-angle camera is called to acquire an original image, the original image is subjected to preliminary processing such as registration, the processed image is driven by camera equipment to return to the camera hardware abstraction layer, other preset algorithms in the loaded camera algorithm library are used for processing to obtain an RGB image, a depth map and a mask map, and the RGB image, the depth map and the mask map are processed to obtain an AR image.

After the sensor of the hardware layer is called, for example, after the sensor 1 in the wide-angle camera is called to acquire an original image, the original image is subjected to preliminary processing such as registration, and the processed image is driven by camera equipment to return to the camera hardware abstraction layer, and then is processed by using other preset algorithms in the loaded algorithm library to obtain an RGB image and a mask image. Meanwhile, after the TOF acquires the depth map, the depth map is driven by the camera equipment to return to the camera hardware abstraction layer. And then processing the RGB image, the depth map and the mask map by using other preset algorithms in the loaded algorithm library to obtain an AR image.

And the camera hardware abstraction layer sends the obtained AR image back to the AR APP for preview display through the camera access interface.

According to the above generation process of the AR image, the AR video may be generated by repeating the above generation process.

The application scenario provided by the embodiment of the application is described below with reference to the accompanying drawings.

Fig. 5 is a schematic diagram of an application scenario provided in an embodiment of the present application. As shown in fig. 5 (a), a control 101 of an "AR" App is displayed on the display interface of the mobile phone. The user clicks control 101 of "AR" App, as shown in (b) of fig. 5. In response to a click operation by the user, a first interface as shown in (c) in fig. 5 is displayed. The first interface includes an AR type option 102, a photographing control 103, and a displayed AR image 104. The user may select a different AR type from the AR type options 102 and then click on the photographing control 103 to generate an AR image 104. Note that, when the photographing control 103 is clicked, the preview image displayed in the first interface is also an AR image. Wherein the acquired image includes at least two objects, illustratively, as shown in (c) of fig. 5, two persons, namely, person 1 and person 2. Wherein, the distance between the person 1 and the camera and the distance between the person 2 and the camera are large in difference, and an overlapping area exists between the person 1 and the person 2. It should be understood that the object in the acquired image may refer to a portrait, or may be an object such as an animal, a building, or a plant, which is not limited in the embodiment of the present application.

The image processing method provided by the embodiment of the application can also be applied to processing the video to generate the AR video, wherein the video comprises two objects. The explanation is continued by taking two persons as examples, wherein one frame of the acquired video comprises two persons, namely a person 1 and a person 2. Wherein, the distance between the person 1 and the camera and the distance between the person 2 and the camera are large in difference, and an overlapping area exists between the person 1 and the person 2.

It should be understood that the foregoing is illustrative of an application scenario, and is not intended to limit the application scenario of the present application in any way.

The image processing method provided in the embodiment of the present application is described in detail below with reference to fig. 6 to 15.

Fig. 6 is a flowchart of an image processing method according to an embodiment of the present application, as shown in fig. 6, where the method includes:

s101, acquiring a first image, and a first depth image and a first mask image corresponding to the first image.

The first image may be an RGB image obtained by performing preliminary processing such as registration on an original image obtained by the sensor 1, driving the processed image back to a camera hardware abstraction layer by a camera device, and processing the processed image by using other preset algorithms in a loaded algorithm library.

The first depth image corresponding to the first image may be obtained by acquiring a depth image at the same time as the first depth image by a TOF when an original image is acquired by a camera; the first image may be input into a depth estimation model to obtain a first depth image; the embodiments of the present application are not limited in this regard.

Illustratively, as shown in (a) of fig. 7, after the sensor acquires the original image, the original image is returned to the hardware abstraction layer of the camera, and then the algorithm in the algorithm library is loaded for processing, so as to obtain the first image. And then inputting the first image into a depth estimation model to obtain a first depth image, and inputting the first image into a semantic segmentation model to obtain a first mask image.

Illustratively, as shown in (b) of fig. 7, after the sensor acquires the original image, the original image is returned to the hardware abstraction layer of the camera, and then the algorithm in the algorithm library is loaded for processing, so as to obtain the first image. Meanwhile, the TOF acquires a first depth image at the same time as the original image. And inputting the first image into a semantic segmentation model to obtain a first mask image.

It should be understood that the first image, the first mask image, and the first depth image are captured for the same scene, and thus there is a correspondence between the first image, the first mask image, and the first depth image. Illustratively, in the region where the person image in the first image is located, there is a corresponding region in the first depth image, and there is a corresponding region in the first mask image as well. For convenience of description, the depth of the object referred to hereinafter refers to a depth value of a corresponding region of the object in the first depth image, the mask of the object refers to a mask of a corresponding region of the object in the first mask image, and the depth of the mask referred to refers to a depth value of a corresponding region of the mask in the first depth image.

S102, obtaining a depth difference value of the same mask in the first mask image.

If the depth difference is greater than the preset threshold 1, S103 is performed.

The depth difference value of the same mask in the first mask image may refer to the depth range in the mask, or may refer to the depth standard deviation in the mask, which is not limited in the embodiment of the present application.

Illustratively, a plurality of masks may be included in the mask image. For example, as shown in fig. 8 (a), the mask 10 and the mask 20 are respectively included, wherein two persons in the mask 10 are stuck, and one person in the mask 20. Alternatively, as shown in fig. 8 (b), the mask 30 and the mask 40 are respectively included, wherein the mask 30 is a person, the mask 40 is a person, and no person is stuck in both masks.

For example, only one mask may be included in the mask image, as shown in fig. 8 (c), including mask 50, and two characters in mask 50, where the two characters are adhered. Alternatively, as shown in (d) of fig. 8. Comprises a mask 60, wherein the mask 60 comprises a figure, and no figure is adhered.

Wherein the preset threshold 1 corresponds to a first preset threshold.

The preset threshold 1 is, for example, 0.5m.

It should be appreciated that since the portrait mask output by the semantic segmentation model can only distinguish people from the background, different people cannot be distinguished. And because the human body is a whole, the situation that one part of the body is closer to the camera and the other part of the body is farther from the camera does not exist. That is, when there is only one character in one mask, the depth values of the mask are not greatly different, and when there are at least two characters in the mask, the depth values of the mask are greatly different.

Illustratively, as shown in (a) of fig. 8, the distances between the two persons in the mask 10 and the camera are different, and thus, the depth values in the mask 10 are different and the differences are large. Only one person in mask 20, the distance between the whole camera and the camera is not quite different, i.e. the depth values in mask 20 are quite different. Therefore, when the depth difference in the mask 10 is greater than the preset threshold 1, it is indicated that at least two persons are included in the mask 10. If the depth difference in the mask 20 is smaller than the preset threshold 1, it indicates that only one character is in the mask 20.

For example, as shown in (b) of fig. 8, where only one person is in the mask 30, the distance between the whole camera and the mask is not very different, i.e. the depth values in the mask 30 are very small. Likewise, only one person is in mask 40, and the distance between the entire camera and the camera is not quite different, i.e. the depth values in mask 40 are quite different. Therefore, when the depth difference in the mask 40 is greater than the preset threshold 1, it is indicated that at least two persons are included in the mask 40. If the depth difference in mask 40 is less than the preset threshold 1, it indicates that only one character is in mask 40.

Illustratively, as shown in (c) of fig. 8, the distances between the two persons in the mask 50 and the camera are different, and thus, the depth values in the mask 50 are different and the differences are large. Therefore, when the depth difference in mask 50 is greater than the preset threshold 1, it is indicated that at least two persons are included in mask 50.

Illustratively, as shown in (d) of fig. 8, only one person is in the mask 60, and the distance between the overall distance camera is not very different, i.e. the depth values in the mask 60 are very small. Therefore, if the depth difference in the mask 60 is smaller than the preset threshold 1, it indicates that only one character is in the mask 60.

In summary, in the case that the depth difference value in the same mask is greater than the preset threshold 1, the mask includes at least two objects.

According to the image processing method provided by the embodiment of the application, the depth difference value of the same mask in the first mask image is obtained, if the depth difference value is larger than the first preset threshold value, the first image is determined to comprise the first object and the second object, which is equivalent to the process of determining whether the first image comprises the first object and the second object at the same time, the depth difference value of the same mask in the first mask image is determined, and the intelligence of determining that the first object and the second object exist in the first image is improved.

For convenience of description, description will be made below taking an object in an image as a portrait. It should be understood that the objects in the image may also refer to animals, plants, buildings, etc., to which embodiments of the present application are not limited.

For the mask described in fig. 8, where two objects are included in the mask 10 in (a) of fig. 8 and the mask 50 in (c) of fig. 8, the step of S103 may be performed on the first depth map corresponding to the mask shown in (a) of fig. 8 and the first depth map corresponding to the mask shown in (c) of fig. 8.

S103, dividing each depth value in the first depth image into a plurality of depth intervals according to a preset interval range.

The depth histogram of the interval corresponding to the mask in the first depth image may be calculated according to a preset interval range. For example, the preset interval range may be divided according to a distance between the target object and the camera, and exemplary, the preset interval range may include a range of 0.5m to 1.5m,1.5m to 2.5m, and 2.5m to 3.5m between the target object and the camera. The resulting depth histogram may indicate the number of pixels within a distance of 0.5m to 1.5m from the camera, the number of pixels within a distance of 1.5m to 2.5m from the camera, and the number of pixels within a distance of 2.5m to 3.5m from the camera.

After the first depth image is divided into a plurality of depth sections according to a preset section range, a section where a main depth in the first depth image is located, for example, a section where a depth where the number of pixels is greater than 30% is located is further acquired. For another example, the section with the largest number of pixels may be directly determined as the section where the main depth in the first depth image is located.

S104, determining a target depth interval with the largest number of pixels in the depth intervals.

S105, determining a first preset template according to the depth range of the target depth interval.

The first preset template is related to a first distance, and the first distance is the distance between the first object and the electronic equipment.

Optionally, the first preset template is a head profile template, and an area of the first preset template is inversely proportional to the first distance.

Since the head contour of the human body is largely different from other body parts, the head contour of the human body can be used as a template to determine the number of people in the mask in the first mask image. The smaller the distance between the human body and the electronic equipment is, the larger the head outline of the human in the image shot by the electronic equipment is, and the larger the distance between the human body and the electronic equipment is, the smaller the head outline of the human in the image shot by the electronic equipment is. Thus, typically the area of the head outline template is inversely proportional to the first distance.

On the basis of S104, after the target depth interval with the largest number of pixels in the first depth image is determined, the region with the largest occupied area in the first depth image is determined, that is, the region where the main person in the image is located. In this case, the depth range of the target depth region is equivalent to the depth range of the portrait, and therefore, the head profile template corresponding to the depth range may be regarded as the first preset template.

Illustratively, the target depth region with the largest number of pixels in the plurality of depth regions is a region of 0.5m to 1.5m, and the preset head profile template includes 3 templates, as shown in fig. 9, including a head profile template with a first distance of 1m, a head profile template with a first distance of 2m, and a head profile template with a first distance of 3 m. Based on the interval that the target depth region is 0.5m to 1.5m, a head contour template with a first distance of 1m is taken as a first preset template. It can be seen that the smaller the first distance, the larger the area of the first preset template, and the larger the first distance, the smaller the area of the first preset template.

Optionally, the candidate template may be rotated according to a shooting angle of the first image, to obtain a first preset template.

S1051, acquiring pose information of the electronic equipment.

The pose information may refer to an included angle between the electronic device and a gravity direction. It will be appreciated that a change in the pose of the electronic device may result in a change in the shape of the captured image. The pose information of the electronic device refers to pose information of the electronic device when the electronic device acquires the first image.

The electronic device may read angular velocities of the electronic device in the gyro sensor around three axes (i.e., x, y, and z axes) to determine pose information of the electronic device, and may also read accelerations in various directions in the acceleration sensor to determine pose information.

S1052, adjusting the shape of the first preset template according to the pose information to obtain an adjusted first preset template.

It will be appreciated that the electronic device is typically not strictly placed in the direction of gravity during the acquisition of the image, and therefore, causes distortion in the acquired image due to the angle of capture. The pose information can indicate the position and the pose of the electronic equipment when the electronic equipment acquires the image, so that the first preset template can be corrected based on the pose information, and the shape of the first preset template is adjusted so that the angle corresponding to the first preset template is close to the shooting angle of the first image, and further the head outline of the head of the portrait of the first image is more matched with the first preset template.

In the embodiment of the application, according to the pose information of the electronic equipment, the first preset template is adjusted, the adjusted first preset template is obtained, the angle of the first preset template is close to the shooting angle of the first image, and then the head outline of the portrait in the adjusted first preset template and the head outline of the portrait in the first image are more matched, so that the first pixel point can be obtained more accurately based on the adjusted first preset template, and the accuracy of the second mask image obtained based on the first pixel point is improved. The accuracy of the AR image obtained based on the second mask image and the first image is improved.

S106, acquiring an edge map corresponding to the first mask image.

The edge map is a binary map formed by outlines of mask masks in the first mask image.

The edge map is a binary map formed by outlines of masks in the first mask image.

Illustratively, the mask 10 is included in the first mask image, as shown in fig. 10 (a), and the edge map is a binary map formed for the outline of the mask 10 in the first mask image, as shown in fig. 10 (b).

S107, performing convolution operation on the first preset template and the edge map to obtain a first pixel point.

The first pixel point is a pixel point in an image area of the first object. For example, the first pixel point may refer to a pixel point of a central region of the first object head. The first pixel point may be a pixel point with the largest convolution value obtained by performing convolution operation on the first preset template and the edge map.

The step of performing convolution operation on the edge map by using the first preset template may be that the first preset template is moved from one side of the edge map to the other side to perform convolution operation, so as to obtain a convolution value.

Illustratively, as shown in fig. 11, the first preset template is convolved from the leftmost side of the edge map to the right. For example, the convolution operation can be performed using the following formula (1).

Formula (1);

wherein, the liquid crystal display device comprises a liquid crystal display device,

representing a convolution value +.>

A pixel value representing any point on the first preset template,

a pixel value representing any point on the edge map.

It should be understood that, each pixel point on the edge map is convolved by using the above formula (1) to obtain a convolution value, where the pixel point with the largest convolution value is the first pixel point.

It should be understood that a first pixel point of an object to be photographed can be determined by using a first preset template, and a first pixel point of the object to be photographed corresponding to the size can be obtained by performing convolution calculation with the outline map by using head outline templates of other sizes.

For example, as shown in (a) of fig. 12, a convolution calculation is performed using a head outline template with a first distance of 1m and an edge map, resulting in a first pixel point in an image area of an object to be photographed with a distance of 1m from the electronic device. As shown in (b) of fig. 12, a convolution calculation is performed with the same edge map using a head contour template having a first distance of 2m, and the obtained first pixel is a first pixel in an image area of the object to be photographed having a distance of 2m from the electronic device.

S108, performing super-pixel segmentation on the first image to obtain a plurality of areas.

The plurality of areas include a first area, and the first area refers to an area including a first pixel point.

The super-pixel segmentation of the first image may refer to segmenting the first image according to brightness, color, etc., to obtain a plurality of regions, where the brightness and color of each region are close. The area including the first pixel point is a first area. It should be understood that super-pixel segmentation is generally to segment an image into regions formed by a number of pixels, that is, the first region includes a number of pixels.

For example, the super-pixel segmentation of the first image may be as shown in fig. 13, where the area where the first pixel point is located is a first area.

S109, acquiring first depth information.

Wherein the first depth information is used to characterize the depth of the first region.

It should be understood that, after the first image is subjected to super-pixel segmentation, the first pixel point is in the first image, and the depth of the first area is the depth of the corresponding area in the first depth image because of the correspondence between the first image, the first depth image and the first mask image. The first region comprises a plurality of pixel points, and the depth of the first region in the corresponding region in the first depth image refers to the average value of the depth values corresponding to the pixel points in the corresponding region.

S110, determining a first region set in the plurality of regions according to the first depth information.

The difference value between the depth of each region in the first region set and the depth of the first region is smaller than a first threshold value, and the first region set comprises the first region.

And obtaining a plurality of areas after the super-pixel segmentation is carried out on the first image, acquiring depth information of each area in the plurality of areas, namely an average depth value of pixel points in each area, and taking an area with a depth difference value smaller than a first threshold value with the first area as an area in a first area set, wherein the first area is also an area in the first area set. The region corresponding to the first region set is a region close to the depth of the first region, that is, the imaging distance corresponding to the region in the first region set is close. It should be appreciated that the shooting distance of the same person is typically close, i.e. the depth of the same person in the image in the various areas is close. The first set of regions is thus a set of image regions of the same person.

S111, generating a second mask image based on the first region set.

As is clear from the description in S110, the difference between the depth of each region in the first region set and the depth of the first region is smaller than the first threshold, and the first region set includes the first region. And merging the areas corresponding to the areas in the first area set in the first mask image to obtain a second mask image. The mask of the first object in the second mask image is an independent mask and is not a mask adhered to other objects to be shot.

And S112, generating an AR image based on the second mask image and the first image.

Since the mask in the second mask image is the mask independent of the first object, the second mask image and the first image can be subjected to enhanced image processing to obtain an AR image.

It should be appreciated that the video may be formed by continuously playing multiple frames of images, and similar to generating the AR image based on the second mask image and the first image, multiple frames of AR images may be generated by using multiple frames of the second mask image and multiple frames of the first image corresponding thereto, and then the AR video may be obtained based on the multiple frames of AR images. That is, AR video may also be generated based on the second mask image and the first image.

The image processing method provided in the embodiment of the application firstly obtains a first image, a first depth image and a first mask image, then adopts a first preset template to determine a first pixel point in the first image, simultaneously carries out super-pixel segmentation on the first image to obtain a plurality of areas, then determines a first area set in the plurality of areas based on first depth information, then obtains a second mask image based on the first area set, and further obtains an augmented reality AR image based on the second mask image and the first image, wherein the first image comprises an image area of a first object and an image area of a second object, the image area of the first object and the image area of the second object are partially overlapped, the first object and the second object are objects to be shot in the same category, the first depth image is used for representing the depth information of the first object and the second object, the first mask image identifies an image area of a first object and an image area of a second object in the first image, the first pixel points are pixels in the image area of the first object, the first preset template is related to a first distance, the first distance is a distance between the first object and the electronic device, the plurality of areas comprise first areas, the first areas comprise first pixels, first depth information is used for representing the depth of the first areas, the difference value between the depth of each area in the first area set and the depth of the first area is smaller than a first threshold value, the first area set comprises the first areas, the second mask image is used for identifying the image area of the first object in the first image, since the first pixels of the image area of the first object in the first image are acquired through the first preset template, the first region set determined based on the depth of the first region where the first pixel point is located is an image region of the same object, and image regions of other objects are not included, so that masks included in a second mask image obtained based on the first region set only indicate the image region of the first object independently, the situation that the depth of the first object is the depth average value of two objects is avoided, the accuracy of the depth of the first object is improved, the problem of dislocation of an avatar in a generated AR image is avoided, and the effect of the generated AR image is improved.

In a possible case, after determining the first pixel point in the first mask image, the second pixel point may be further determined, that is, the pixel point in the image area of the second object, so that a second area may be obtained according to the second pixel point, where the second area refers to an area including the second pixel point in a plurality of areas obtained by performing superpixel segmentation on the first image, and then a second area set is determined according to the second area, where a difference between a depth of an area in the second area set and a depth of the second area is smaller than a second threshold, and the second area set includes the second area. This corresponds to determining a set of regions in the first mask image that are close to the depth value of the second pixel, i.e. regions of the second object. Described in detail below with the embodiment shown in fig. 14.

Fig. 14 is a flowchart of an image processing method according to another embodiment of the present application, as shown in fig. 14, where the method includes:

s201, acquiring a first image, and a first depth image and a first mask image corresponding to the first image.

The implementation steps and principles of S201 are similar to those of S101, and are not described herein.

S202, acquiring a depth difference value of the same mask in the first mask image, and if the depth difference value is greater than a preset threshold value 1, executing S203.

The implementation steps and principles of S202 are similar to those of S102, and are not described herein.

S203, dividing each depth value in the first depth image into a plurality of depth intervals according to a preset interval range.

The implementation steps and principles of S203 are similar to those of S103, and are not described herein.

S204, determining a target depth interval with the largest number of pixels in the depth intervals.

The implementation steps and principles of S204 are similar to those of S104, and are not described herein.

S205, determining a first preset template according to the depth range of the target depth interval.

The implementation steps and principles of S205 are similar to those of S105, and are not described herein.

S2051, acquiring pose information of the electronic equipment.

The implementation steps and principles of S2051 are similar to those of S1051, and are not described herein.

S2052, adjusting the shape of the first preset template according to the pose information to obtain an adjusted first preset template.

The implementation steps and principles of S2052 are similar to those of S1052, and are not described herein.

S206, determining a second preset template.

The second preset template is related to a second distance, and the second distance refers to a distance between the second object and the electronic device.

The first preset template is a template corresponding to a depth range where the main depth in the mask is located. The second preset template may refer to a template corresponding to other depth ranges in the mask. On the basis of obtaining the depth difference value of the same mask in the first mask image in S202, the region corresponding to the target depth range is removed from the same mask in the first depth image, and the depth interval (for convenience of description, referred to herein simply as the second depth interval) with the largest number of pixels can be determined from other depth ranges, and the template corresponding to the second depth interval is used as a second preset template.

S207, acquiring an edge map corresponding to the first mask image, wherein the edge map is a binary map formed by the outline of the mask in the first mask image.

The implementation steps and principles of S207 are similar to those of S106, and are not described herein.

S208, performing convolution operation on the first preset template and the edge map to obtain a first pixel point.

The first pixel point is a pixel point of an image area of the first object. For example, the first pixel point may refer to a pixel point of a central region of the first object head. The first pixel point may be a pixel point with the largest convolution value obtained by performing convolution operation on the first preset template and the edge map.

S209, performing convolution operation on the second preset template and the edge map to obtain a second pixel point.

The second pixel is a pixel in the image area of the second object, for example, the second pixel may refer to a pixel in the central area of the head of the second object. The second pixel point may be a pixel point with the largest convolution value obtained by performing convolution operation on the second preset template and the edge map.

The specific process of obtaining the second pixel point by adopting the second preset template and the edge map to perform convolution calculation can be similar to that by adopting the first preset template and the edge map, and will not be described herein again.

Since the template used for convolution with the edge map is a preset template, i.e., a template selected from a plurality of known templates. Therefore, in a possible case, convolution operation is performed by sequentially adopting other templates except the first preset template and the edge map, and a second pixel point of the second object is obtained according to the convolution result. The second pixel point refers to a pixel point in the image area of the second object. It can be understood that the template obtained by performing convolution calculation on the determined edge map to obtain the second pixel point is the second preset template.

S210, performing super-pixel segmentation on the first image to obtain a plurality of areas.

The plurality of areas comprise a first area and a second area, wherein the first area refers to an area comprising first pixel points, and the second area refers to an area comprising second pixel points.

The implementation steps and principles of S210 are similar to those of S108, and will not be described here again.

S211, acquiring first depth information and second depth information.

Wherein the first depth information is used to represent the depth of the first region. The second depth information is used to represent the depth of the second region.

It should be understood that the second region refers to a region of the second pixel point in the first image after the first image is subjected to super-pixel segmentation, and the depth of the second region refers to the depth of the corresponding region of the second region in the first depth image because of the correspondence between the first image, the first depth image and the first mask image. The second region comprises a plurality of pixel points, and the depth of the second region in the corresponding region in the first depth image is the average value of the depth values corresponding to the pixel points in the corresponding region.

S212, determining a first region set in the plurality of regions according to the first depth information.

And obtaining a plurality of areas after the super-pixel segmentation is carried out on the first image, acquiring depth information of each area in the plurality of areas, namely an average depth value of pixel points in each area, and taking an area with a depth difference value smaller than a first threshold value with the first area as an area in a first area set, wherein the first area is also an area in the first area set. The region corresponding to the first region set is a region close to the depth of the first region, that is, the imaging distance corresponding to the region in the first region set is close. It should be appreciated that the shooting distance of the same person is typically close, i.e. the depth of the same person in the image in the various areas is close. The first set of regions is thus a set of image regions of the same person. In general, the closer the distance from the camera is, the larger the area occupied by the object to be photographed in the first image is. Because the first preset template is determined according to the target depth interval with the largest number of pixels in the depth intervals. Therefore, the first pixel determined by the first preset template is a pixel in the object to be shot (first object) occupying a larger area in the first image. That is, the first object may refer to an object to be photographed, which is closer to the camera, in the same mask in the first mask image.

S213, determining a second region set in the plurality of regions according to the second depth information.

Wherein the difference between the depth of each region in the second set of regions and the depth of the second region is less than a first threshold, the second set of regions comprising the second region.

And obtaining a plurality of areas after the super-pixel segmentation is carried out on the first image, acquiring depth information of each area in the plurality of areas, namely an average depth value of pixel points in each area, and then taking an area with a depth difference value smaller than a second threshold value with a second area as an area in a second area set, wherein the second area is also an area in the second area set. The first threshold value and the second threshold value may be the same or different, which is not limited in the embodiment of the present application. The region corresponding to the second region set is a region close to the depth of the second region, that is, the imaging distance corresponding to the region in the second region set is close. It should be appreciated that the shooting distance of the same person is typically close, i.e. the depth of the same person in the image in the various areas is close. The second set of regions is thus a set of image regions of the same person, i.e. of the second object. Because the first object refers to the object to be shot, which is closer to the camera, in the same mask in the first mask image, the second object refers to the object to be shot, which is farther from the camera, in the same mask in the first mask image.

It should be appreciated that more objects to be photographed may be included in the same mask in the first mask image, for example, three objects to be photographed may be included in the same mask. In the case that the same mask includes a plurality of objects to be shot, a similar method may be used to determine a third pixel point of a third object in the first image, and determine a third region and a third region set based on the third pixel point. The third region refers to an image region including a third pixel point in the first image, a depth difference between a depth of each region in the third region set and the third region is smaller than a third threshold, and the third region set includes the third region. The first threshold, the second threshold, and the third threshold may be the same threshold, or may be different thresholds, which is not limited in this embodiment of the present application.

S214, generating a second mask image based on the first region set and the second region set.

As is clear from the description in S213 above, the difference between the depth of each region in the first region set and the depth of the first region is smaller than the first threshold, and the first region set includes the first region. The difference between the depth of each region in the second set of regions and the depth of the second region is less than a second threshold. And merging the areas corresponding to the areas in the first area set in the first mask image, and merging the areas in the second area set in the first mask image to obtain the second mask image. The mask of the first object in the second mask image is an independent mask, and the mask of the second object is also an independent mask and is not a mask adhered to other objects to be shot.

S215, generating an AR image based on the second mask image and the first image.

Since the mask in the second mask image is the first object independent mask and the second object independent mask. Therefore, the second mask image and the first image can be subjected to enhanced image processing to obtain an AR image.

Fig. 15 is a flowchart of an image processing method according to another embodiment of the present application, as shown in fig. 15, where the method includes:

s301, acquiring a first image, a first depth image and a first mask image.

The first image comprises an image area of a first object and an image area of a second object, the image area of the first object and the image area of the second object are partially overlapped, the first object and the second object are objects to be shot in the same category, the first depth image is used for representing depth information of the first object and the second object, and the first mask image is used for identifying the image area of the first object and the image area of the second object in the first image.

S302, determining a first pixel point in a first image by adopting a first preset template.

The first pixel point is a pixel point in an image area of the first object, the first preset template is related to a first distance, and the first distance is a distance between the first object and the electronic device.

S303, performing super-pixel segmentation on the first image to obtain a plurality of areas.

The plurality of areas comprise a first area, and the first area comprises a first pixel point;

s304, determining a first region set in the plurality of regions based on the first depth information.

The first depth information is used for representing the depth of a first region, the difference value between the depth of each region in the first region set and the depth of the first region is smaller than a first threshold value, and the first region set comprises the first region;

s305, obtaining a second mask image based on the first region set.

Wherein the second mask image is used to identify an image region of the first object in the first image;

It should be understood that, although the steps in the flowcharts in the above embodiments are sequentially shown as indicated by arrows, these steps are not necessarily sequentially performed in the order indicated by the arrows. The steps are not strictly limited to the order of execution unless explicitly recited herein, and the steps may be executed in other orders. Moreover, at least a portion of the steps in the flowcharts may include a plurality of sub-steps or stages that are not necessarily performed at the same time, but may be performed at different times, the order in which the sub-steps or stages are performed is not necessarily sequential, and may be performed in turn or alternately with at least a portion of the sub-steps or stages of other steps or other steps.

It will be appreciated that in order to achieve the above-described functionality, the electronic device comprises corresponding hardware and/or software modules that perform the respective functionality. The steps of an algorithm for each example described in connection with the embodiments disclosed herein may be embodied in hardware or a combination of hardware and computer software. Whether a function is implemented as hardware or computer software driven hardware depends upon the particular application and design constraints imposed on the solution. Those skilled in the art may implement the described functionality using different approaches for each particular application in conjunction with the embodiments, but such implementation is not to be considered as outside the scope of this application.

The embodiment of the application may divide the functional modules of the electronic device according to the above method example, for example, each functional module may be divided corresponding to each function, or two or more functions may be integrated into one module. It should be noted that, in the embodiment of the present application, the division of the modules is schematic, which is merely a logic function division, and other division manners may be implemented in actual implementation. It should be noted that, in the embodiment of the present application, the names of the modules are schematic, and the names of the modules are not limited in actual implementation.

Fig. 16 is a schematic structural diagram of an image processing apparatus according to an embodiment of the present application.

It should be understood that the image processing apparatus 600 may perform the image processing methods shown in fig. 6 to 15; the image processing apparatus 600 includes: an acquisition unit 610 and a processing unit 620.

The obtaining unit 610 is configured to obtain a first image, a first depth image, and a first mask image, where the first image includes an image area of a first object and an image area of a second object, the image area of the first object and the image area of the second object are partially overlapped, the first object and the second object are objects to be shot in the same class, the first depth image is used to represent depth information of the first object and the second object, and the first mask image identifies the image area of the first object and the image area of the second object in the first image;

The processing unit 620 is configured to determine a first pixel in the first image by using a first preset template, where the first pixel is a pixel in an image area of the first object, the first preset template is related to a first distance, and the first distance is a distance between the first object and the electronic device;

the processing unit 620 is configured to perform super-pixel segmentation on the first image to obtain a plurality of regions, where the plurality of regions includes a first region, and the first region includes a first pixel point;

the processing unit 620 is configured to determine a first set of regions of the plurality of regions based on first depth information, the first depth information being used to characterize a depth of the first region, a difference between a depth of each region in the first set of regions and the depth of the first region being less than a first threshold, the first set of regions including the first region;

the processing unit 620 is configured to obtain, based on the first region set, a second mask image, where the second mask image is used to identify an image region of the first object in the first image;

the processing unit 620 is configured to obtain an augmented reality AR image based on the second mask image and the first image.

In one embodiment, if the first object and the second object are human, the first preset template is a head contour template, and the first distance is inversely proportional to an area of the first preset template.

In one embodiment, the obtaining unit 610 is further configured to obtain pose information of the electronic device;

the processing unit 620 is further configured to adjust the shape of the head contour template based on the pose information.

In one embodiment, the processing unit 620 is further configured to divide each depth value in the first depth image into a plurality of depth intervals according to a preset interval range; determining a target depth interval with the largest number of pixels in the plurality of depth intervals;

In one embodiment, the processing unit 620 is configured to obtain an edge map corresponding to the first mask image, where the edge map is a binary map formed by contours of mask masks in the first mask image;

In one embodiment, the processing unit 620 is configured to obtain a depth difference value of the same mask in the first mask image; and if the depth difference value is larger than a first preset threshold value, determining that the first image comprises the first object and the second object.

In one embodiment, the processing unit 620 is configured to determine a second pixel in the first image using a second preset template, where the second pixel is a pixel in the image area of the second object, and the second preset template is related to a second distance, and the second distance is a distance between the second object and the electronic device; the plurality of regions further comprises a second region, and the second region comprises second pixel points; determining a second set of regions of the plurality of regions based on second depth information, the second depth information being used to characterize a depth of the second region, a difference between a depth of each region of the second set of regions and a depth of the second region being less than a second threshold, the second set of regions comprising the second region; and obtaining a second mask image based on the first region set and the second region set.

In one embodiment, the processing unit 620 is configured to obtain an edge map corresponding to the first mask image, where the edge map is a binary map formed by a contour of an object in the first mask image; and carrying out convolution operation on the second preset template and the edge map to obtain a second pixel point.

In one embodiment, the first depth image is acquired by a camera in the electronic device.

In one embodiment, the first depth image is obtained by inputting the first image into a preset depth estimation model, and the depth estimation model is a neural network model.

In one embodiment, the first mask image is obtained by inputting the first image into a preset semantic segmentation model, and the semantic segmentation model is a neural network model.

The image processing apparatus provided in this embodiment is configured to execute the image processing method in the foregoing embodiment, and the technical principles and technical effects are similar and are not described herein again.

The image processing apparatus 600 is embodied as a functional unit. The term "unit" herein may be implemented in software and/or hardware, without specific limitation.

For example, a "unit" may be a software program, a hardware circuit or a combination of both that implements the functions described above. The hardware circuitry may include application specific integrated circuits (application specific integrated circuit, ASICs), electronic circuits, processors (e.g., shared, proprietary, or group processors, etc.) and memory for executing one or more software or firmware programs, merged logic circuits, and/or other suitable components that support the described functions.

Thus, the elements of the examples described in the embodiments of the present application can be implemented in electronic hardware, or in a combination of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

Fig. 17 shows a schematic structural diagram of an electronic device provided in the present application. The dashed line in fig. 17 indicates that the unit or the module is optional. The electronic device 700 may be used to implement the image processing method described in the method embodiments described above.

The electronic device 700 includes one or more processors 701, which one or more processors 701 may support the electronic device 700 to implement the image processing method in the method embodiments. The processor 701 may be a general-purpose processor or a special-purpose processor. For example, the processor 701 may be a central processing unit (central processing unit, CPU), digital signal processor (digital signal processor, DSP), application specific integrated circuit (application specific integrated circuit, ASIC), field programmable gate array (field programmable gate array, FPGA), or other programmable logic device such as discrete gates, transistor logic, or discrete hardware components.

The processor 701 may be used to control the electronic device 700, execute a software program, and process data of the software program. The electronic device 700 may further comprise a communication unit 705 for enabling input (reception) and output (transmission) of signals.

For example, the electronic device 700 may be a chip, the communication unit 705 may be an input and/or output circuit of the chip, or the communication unit 705 may be a communication interface of the chip, which may be an integral part of a terminal device or other electronic device.

For another example, the electronic device 700 may be a terminal device, the communication unit 705 may be a transceiver of the terminal device, or the communication unit 705 may be a transceiver circuit of the terminal device.

The electronic device 700 may include one or more memories 702 having a program 704 stored thereon, the program 704 being executable by the processor 701 to generate instructions 703 such that the processor 701 performs the impedance matching method described in the above method embodiments according to the instructions 703.

Optionally, the memory 702 may also have data stored therein. Alternatively, processor 701 may also read data stored in memory 702, which may be stored at the same memory address as program 704, or which may be stored at a different memory address than program 704.

The processor 701 and the memory 702 may be provided separately or may be integrated together; for example, integrated on a System On Chip (SOC) of the terminal device.

Illustratively, the memory 702 may be used to store a related program 704 of the image processing method provided in the embodiment of the present application, and the processor 701 may be used to invoke the related program 704 of the image processing method stored in the memory 702 when performing image processing, to execute the image processing method of the embodiment of the present application; comprising the following steps: acquiring a first image, a first depth image and a first mask image, wherein the first image comprises an image area of a first object and an image area of a second object, the image area of the first object and the image area of the second object are partially overlapped, the first object and the second object are objects to be shot in the same category, the first depth image is used for representing depth information of the first object and the second object, and the first mask image is used for identifying the image area of the first object and the image area of the second object in the first image; determining a first pixel point in a first image by adopting a first preset template, wherein the first pixel point is a pixel point in an image area of a first object, the first preset template is related to a first distance, and the first distance is a distance between the first object and electronic equipment; performing super-pixel segmentation on the first image to obtain a plurality of areas, wherein the plurality of areas comprise a first area, and the first area comprises first pixel points; determining a first region set of the plurality of regions based on first depth information, the first depth information being used to characterize a depth of the first region, a difference between a depth of each region in the first region set and the depth of the first region being less than a first threshold, the first region set comprising the first region; obtaining a second mask image based on the first region set, wherein the second mask image is used for identifying an image region of a first object in the first image; and obtaining the augmented reality AR image based on the second mask image and the first image.

The present application also provides a computer program product which, when executed by the processor 701, implements the image processing method according to any of the method embodiments of the present application.

The computer program product may be stored in the memory 702, for example, the program 704, and the program 704 is finally converted into an executable object file capable of being executed by the processor 701 through preprocessing, compiling, assembling, and linking.

The present application also provides a computer-readable storage medium having stored thereon a computer program which, when executed by a computer, implements the image processing method according to any of the method embodiments of the present application. The computer program may be a high-level language program or an executable object program.

Such as memory 702. The memory 702 may be volatile memory or nonvolatile memory, or the memory 702 may include both volatile and nonvolatile memory. The nonvolatile memory may be a read-only memory (ROM), a Programmable ROM (PROM), an Erasable PROM (EPROM), an electrically Erasable EPROM (EEPROM), or a flash memory. The volatile memory may be random access memory (random access memory, RAM) which acts as an external cache. By way of example, and not limitation, many forms of RAM are available, such as Static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double data rate SDRAM (DDR SDRAM), enhanced SDRAM (ESDRAM), synchronous DRAM (SLDRAM), and direct memory bus RAM (DR RAM).

In the present application, "at least one" means one or more, and "a plurality" means two or more. "at least one of" or the like means any combination of these items, including any combination of single item(s) or plural items(s). For example, at least one (one) of a, b, or c may represent: a, b, c, a-b, a-c, b-c, or a-b-c, wherein a, b, c may be single or plural.

It should be understood that, in various embodiments of the present application, the sequence numbers of the foregoing processes do not mean the order of execution, and the order of execution of the processes should be determined by the functions and internal logic thereof, and should not constitute any limitation on the implementation process of the embodiments of the present application.

Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

It will be clear to those skilled in the art that, for convenience and brevity of description, specific working procedures of the above-described systems, apparatuses and units may refer to corresponding procedures in the foregoing method embodiments, and are not repeated herein.

In the several embodiments provided in this application, it should be understood that the disclosed apparatus and method may be implemented in other ways. For example, the device embodiments described above are merely illustrative; for example, the division of the units is only one logic function division, and other division modes can be adopted in actual implementation; for example, multiple units or components may be combined or may be integrated into another system, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical or other form.

The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in each embodiment of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit.

The foregoing is merely specific embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily think about changes or substitutions within the technical scope of the present application, and the changes and substitutions are intended to be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. An image processing method, comprising:

Determining a first pixel point in the first image by adopting a first preset template, wherein the first pixel point is a pixel point in an image area of the first object, the first preset template is related to a first distance, and the first distance is a distance between the first object and electronic equipment;

performing super-pixel segmentation on the first image to obtain a plurality of areas, wherein the plurality of areas comprise a first area, and the first area comprises the first pixel point;

determining a first set of regions of the plurality of regions based on first depth information, the first depth information being used to characterize a depth of the first region, a difference between a depth of each region of the first set of regions and the depth of the first region being less than a first threshold, the first set of regions comprising the first region;

and obtaining a second mask image based on the first region set, wherein the second mask image is used for identifying the image region of the first object in the first image.

2. The method of claim 1, wherein if the first object and the second object are human, the first preset template is a head contour template, and the first distance is inversely proportional to an area of the first preset template.

3. The method as recited in claim 2, further comprising:

acquiring pose information of the electronic equipment;

and adjusting the shape of the head profile template based on the pose information.

4. A method according to any one of claims 1 to 3, further comprising:

determining a target depth interval with the largest number of pixels in the depth intervals;

and determining the first preset template according to the depth range of the target depth interval.

5. A method according to any one of claims 1 to 3, wherein said determining a first pixel point in said first image using a first preset template comprises:

acquiring an edge map corresponding to the first mask image, wherein the edge map is a binary map formed by the outline of a mask in the first mask image;

and carrying out convolution operation on the first preset template and the edge map to obtain the first pixel point.

6. A method according to any one of claims 1 to 3, further comprising:

acquiring a depth difference value of the same mask in the first mask image;

7. A method according to any one of claims 1 to 3, further comprising:

determining a second pixel point in the first image by adopting a second preset template, wherein the second pixel point is a pixel point in an image area of the second object, the second preset template is related to a second distance, and the second distance is a distance between the second object and the electronic equipment;

the plurality of regions further includes a second region including the second pixel point;

the obtaining a second mask image based on the first region set includes:

and obtaining the second mask image based on the first region set and the second region set.

8. The method of claim 7, wherein determining a second pixel in the first image using a second preset template comprises:

acquiring an edge map corresponding to the first mask image, wherein the edge map is a binary map formed by the outline of an object in the first mask image;

and carrying out convolution operation on the second preset template and the edge map to obtain the second pixel point.

9. A method according to any one of claims 1 to 3, wherein the first depth image is acquired by a camera in the electronic device.

10. A method according to any one of claims 1 to 3, wherein the first depth image is obtained by inputting the first image into a preset depth estimation model, and the depth estimation model is a neural network model.

11. A method according to any one of claims 1 to 3, wherein the first mask image is obtained by inputting the first image into a preset semantic segmentation model, and the semantic segmentation model is a neural network model.

12. A method according to any one of claims 1 to 3, further comprising:

And obtaining an Augmented Reality (AR) image or obtaining an AR video based on the second mask image and the first image.

13. An image processing apparatus, characterized in that the apparatus comprises a processor and a memory for storing a computer program, the processor being adapted to call and run the computer program from the memory, so that the apparatus performs the method of any one of claims 1 to 12.

14. A chip comprising a processor which, when executing instructions, performs the method of any of claims 1 to 12.

15. An electronic device comprising a processor for coupling with a memory and reading instructions in the memory and, in accordance with the instructions, causing the electronic device to perform the method of any one of claims 1 to 12.

16. A computer readable storage medium, characterized in that the computer readable storage medium stores a computer program, which when executed by a processor causes the processor to perform the method of any of claims 1 to 12.