WO2022179581A1

WO2022179581A1 - Image processing method and related device

Info

Publication number: WO2022179581A1
Application number: PCT/CN2022/077788
Authority: WO
Inventors: 刘毅; 罗达新; 万单盼; 许松岑
Original assignee: 华为技术有限公司
Priority date: 2021-02-26
Filing date: 2022-02-25
Publication date: 2022-09-01
Also published as: CN113066001A

Abstract

Disclosed are an image processing method and a related device, of computer vision technology in the field of artificial intelligence, which can perform blurring processing of different degrees on a background area of the current image frame, such that the background area of the current image frame has a hierarchical blurring effect, that is, a more realistic blurring effect. The method in the present application comprises: acquiring depth information of a background area of the current image frame; dividing the background area into a plurality of sub-areas according to the depth information, wherein the distances from captured objects corresponding to different sub-areas to a camera are different; in the plurality of sub-areas, acquiring a motion vector of each sub-area, wherein the motion vector of each sub-area is used for indicating the motion condition of the sub-area relative to a previous image frame; and according to the motion vector of each sub-area, performing blurring processing on the sub-area.

Description

An image processing method and related equipment

This application claims the priority of the Chinese patent application with the application number 202110218462.0 and the application title "An Image Processing Method and Related Equipment" filed with the China Patent Office on February 26, 2021, the entire contents of which are incorporated herein by reference middle.

technical field

The present application relates to the field of computer technology, and in particular, to an image processing method and related equipment.

Background technique

Panning refers to a shooting method of tracking a moving target object. The image obtained by this shooting method can present a clear foreground area (including the target object) and a blurred background area. When users use terminal devices to pan, they usually need to grasp the shutter speed. If the shutter speed is too high, the background area of the image will not have obvious blur effect. If the shutter speed is too slow, the foreground area of the image will not be clear enough.

In view of the difficulty and uncontrollability of panning, the user can obtain a set of image frames with a high shutter speed through the terminal device (because the shutter speed is too high, the background area of this set of image frames does not have obvious blur effect ), and then process it. Specifically, suppose that the group of image frames includes three image frames sorted by time (with any one of the image frames as the current image frame), the terminal device can first align the three image frames based on the target object, and then align the three image frames based on the target object. Frame interpolation is performed between adjacent image frames to obtain more image frames. Then, the terminal device performs frame mixing of the original image frame and the inserted image frame, so that the background area of the current image frame has a blurring effect.

In the above process, due to the limitations of the frame mixing technology, if the number of image frames for frame mixing is small, the blurring effect of the background area of the current image frame is often unrealistic, such as ghosting and blurring in the background area. .

SUMMARY OF THE INVENTION

The embodiments of the present application provide an image processing method and related equipment, which can perform different degrees of blurring on the background area of the current image frame, so that the background area of the current image frame has a layered blurring effect, that is, a more realistic background area. Blur effect.

A first aspect of the embodiments of the present application provides an image processing method, the method comprising:

When the user needs to pan the moving target object, a group of continuous image frames can be acquired through the camera of the terminal device at a high shutter speed. In this group of image frames, each image frame includes a foreground area and a background area, wherein both the foreground area and the background area include (present) a subject, and the subject included in the foreground area is generally the target of the user's attention object, the subject contained in the background area is a non-target object that the user does not pay attention to.

Since the background area of the group of image frames does not have obvious blurring effect, the terminal device needs to process it. In the group of image frames, the terminal device may select one of the image frames as the image frame to be processed, that is, the current image frame. Next, the terminal device can obtain the depth information of the background area of the current image frame, and the depth information of the background area of the current image frame is used to indicate the distance from each subject included in the background area to the camera, that is, these subjects are in the actual environment ( The distance from the position in 3D space to the camera.

It is worth noting that the distances from different subjects to the camera are different. For example, in the current image frame, the foreground area includes the moving vehicle, and the background area includes the tree behind the vehicle and the house behind the tree, so the distance from the tree to the camera and The distance from the house to the camera varies. Therefore, the terminal device can divide the background area of the current image frame into multiple sub-areas according to the depth information of the background area of the current image frame. Still as in the previous example, the real background area of the current image can be divided into two sub-areas, one sub-area. The area contains the tree behind the vehicle and another sub-area contains the house behind the tree. In this way, the distances from the objects corresponding to (including) different sub-regions to the camera are different.

Finally, the terminal device performs different degrees of blurring processing on different sub-regions to obtain the processed current image frame.

It can be seen from the above method that after acquiring the depth information of the background area of the current image frame, the terminal device divides the background area into multiple sub-areas according to the depth information. Since the distances from the objects corresponding to the different sub-regions to the camera are different, the motion conditions of the different sub-regions relative to the previous image frame are also different. Therefore, the terminal device can perform blurring processing on different sub-regions to different degrees, so that the background region of the current image frame has a more realistic blurring effect.

In a possible implementation manner, the depth information of the background area of the current image frame includes the depth value of each pixel in the background area of the current image frame, and dividing the background area into multiple sub-areas according to the depth information specifically includes: The depth value of each pixel in the background area of the image frame determines the depth change rate of each pixel in the background area of the current image frame. The depth change rate of each pixel is based on the depth value of the pixel and the pixel. The depth values of the remaining surrounding pixels are determined; according to the depth change rate of each pixel point and the preset change rate threshold, the background area is divided into multiple sub-areas. In the foregoing implementation manner, for any pixel in the background area of the current image frame, the depth value of the pixel is used to indicate the distance from the corresponding position of the pixel in the actual environment to the camera. Therefore, the terminal device can determine the depth change rate of the pixel point according to the depth value of each pixel point and the depth values of the remaining pixel points around the pixel point, and the depth change rate of the pixel point is used to indicate that the pixel point is in actual The difference between the distance from the corresponding position in the environment to the camera and the distance from the corresponding position of the surrounding pixels in the actual environment to the camera. Then, after obtaining the depth change rate of all pixels in the background area of the current image frame, the terminal device can accurately divide the background area of the current image frame into multiple sub-areas according to the depth change rate. The distance from the subject to the camera is different.

In a possible implementation manner, different degrees of blurring are performed on different sub-regions, and obtaining the processed current image frame specifically includes: in multiple sub-regions, acquiring the motion vector of each sub-region, the motion vector of each sub-region The vector is used to indicate the motion of the sub-region relative to the previous image frame; the sub-region is blurred according to the motion vector of each sub-region to obtain the processed current image frame. In the foregoing implementation manner, the camera generally rotates or translates when tracking the target object. When the camera is moving, the movement of the objects at different distances relative to the camera (which can also be understood as the degree of movement) is different. For example, the closer objects move more, and the farther objects move Objects move to a lesser extent, which is shown in successive image frames captured by the camera. Specifically, in the multiple sub-areas of the background area of the current image frame, since the distances of the objects corresponding to different sub-regions to the camera are different, the motions of the objects corresponding to different sub-regions relative to the camera are different. Therefore, taking the previous image frame of the current image frame as a reference, different sub-regions of the background region of the current image frame have different motions relative to the previous image frame. For example, suppose that the background region of the current image frame contains two sub-regions A and B. Then, the motion of sub-region A from the previous image frame to the current image frame is different from the motion of sub-region B from the previous image frame to the current image frame. In order to determine the motion condition of each sub-area in the background area of the current image frame, the terminal device may obtain the motion vector of each sub-area, and the motion vector of each sub-area is used to indicate the motion condition of the sub-area relative to the previous image frame. Then, the terminal device performs blurring processing on each sub-region according to the motion vector of the sub-region. After all sub-regions are blurred, the background region of the current image frame can have a real blur effect.

It can be seen from this implementation manner that, after acquiring the depth information of the background area of the current image frame, the terminal device divides the background area into multiple sub-areas according to the depth information. Since the distances from the objects corresponding to the different sub-regions to the camera are different, the motion conditions of the different sub-regions relative to the previous image frame are also different. Therefore, the terminal device can acquire the motion vector of each sub-region, and the motion vector of each sub-region is used to indicate the motion of the sub-region relative to the previous image frame. Since the motion conditions of different sub-areas are different, that is, the motion vectors of different sub-areas are different, the terminal device can perform blurring processing on the sub-areas according to the motion vector of each sub-area. , blurring different sub-regions to different degrees, so that the background region of the current image frame has a more realistic blurring effect.

In a possible implementation manner, the motion vector of each sub-area includes the motion speed of the sub-area and the motion direction of the sub-area. In multiple sub-areas, acquiring the motion vector of each sub-area includes: for the multiple sub-areas For each sub-area, the terminal device can determine the movement speed of the sub-area according to the movement speed of at least one target pixel in the sub-area from the previous image frame to the current image frame. The average value of the movement speed is taken as the movement speed of the sub-region. Further, the terminal device can also determine the movement direction of the sub-region according to the movement direction of at least one target pixel from the previous image frame to the current image frame. For example, the movement directions of these target pixels are usually the same, so the terminal The movement direction of the target pixel of this part of the device is used as the movement direction of the sub-area. Through the foregoing implementation manner, the terminal device can more accurately estimate the motion speed and motion direction of each sub-region, that is, relatively accurately estimate the motion of each sub-region relative to the previous image frame.

In a possible implementation manner, performing the blurring process on each sub-region according to the motion vector of the sub-region specifically includes: for each sub-region, constructing the corresponding sub-region according to the movement speed of the sub-region and the movement direction of the sub-region. The convolution kernel of the sub-region is processed by the convolution kernel corresponding to the sub-region. In the foregoing implementation manner, since the motion vectors of different sub-regions are different (generally, the motion speeds of different sub-regions are different, and the motion directions of different sub-regions are the same), volumes corresponding to different sub-regions can be constructed based on the motion vectors of different sub-regions. The convolution kernel corresponding to different sub-regions is used to perform convolution processing on the corresponding sub-regions, so that different sub-regions can be blurred to different degrees, so that the background region of the current image frame has a more realistic blur effect.

In a possible implementation manner, at least one target pixel point is a corner point. In the aforementioned implementation manner, the target pixels in a certain sub-area are generally the corner points in the sub-area. Since the characteristics of the corner points are relatively obvious, the movement of the corner points in the sub-area can better represent the movement of the sub-area. Happening.

In a possible implementation manner, the movement speed and movement direction of at least one target pixel point are acquired by an optical flow method. In the foregoing implementation manner, the terminal device can determine the moving distance of the target pixel from the previous image frame to the current image frame, the position of the target pixel in the previous image frame, and the position of the target pixel in the current image frame through the optical flow method. . In this way, the terminal device can determine the moving speed of the target pixel based on the moving distance of the target pixel from the previous image frame to the current image frame, and based on the position of the target pixel in the previous image frame and the target pixel in the current image. The position in the frame determines the direction of movement of the target pixel.

In a possible implementation manner, acquiring the depth information of the background area of the current image frame specifically includes: acquiring the depth value of each pixel in the current image frame and the background area of the current image frame; In the depth value of , determine the depth value of each pixel in the background area of the current image frame. In the foregoing implementation manner, the current image frame includes a foreground area and a background area, and the terminal device may perform area segmentation on the current image frame to obtain the background area of the current image frame. Further, the terminal device can also obtain the depth values of all pixels in the current image frame, and determine the depth value of each pixel in the background area of the current image frame, so as to use this part of the depth values to convert the background area of the current image frame. Divided into multiple sub-regions.

In a possible implementation manner, acquiring the depth value of each pixel in the current image frame specifically includes: acquiring the depth value of each pixel in the current image frame through a first neural network. In the foregoing implementation manner, accurate monocular depth estimation can be performed on the current image frame through the first neural network, so as to obtain the depth values of all pixels in the current image frame.

In a possible implementation manner, the camera is a depth camera, and acquiring the depth value of each pixel in the current image frame specifically includes: acquiring the depth value of each pixel in the current image frame through the depth camera. In the foregoing implementation manner, the depth values of all pixels in the current image frame can be accurately acquired through the depth camera.

In a possible implementation manner, acquiring the background area of the current image frame specifically includes: acquiring the background area of the current image frame through a second neural network. In the foregoing implementation manner, accurate salient target detection can be performed on the current image frame through the second neural network, thereby distinguishing the foreground area and the background area of the current image frame to obtain the background area of the current image frame.

In a possible implementation manner, the depth camera is a time of flight (TOF) camera or a structured light camera.

In a possible implementation manner, the first neural network or the second neural network is any one of a multilayer perceptron, a convolutional neural network, a recurrent neural network, and a recurrent neural network.

A second aspect of the embodiments of the present application provides a model training method, the method includes: obtaining a depth value of each pixel in an image frame to be trained through a first model to be trained; The deviation between the depth value of each pixel in the training image frame and the true depth value of each pixel in the image frame to be trained; update the parameters of the first model to be trained according to the deviation until the model training conditions are met, Get the first neural network.

It can be seen from the above method that the first neural network trained by this method can perform accurate monocular depth estimation on any image frame, thereby obtaining the depth values of all pixels in the image frame.

A third aspect of the embodiments of the present application provides a model training method, the method includes: obtaining a background area of an image frame to be trained by using a second model to be trained; calculating the background area of the image frame to be trained by using a preset target loss function The deviation between the region and the real background region of the image frame to be trained; the parameters of the second model to be trained are updated according to the deviation until the model training conditions are met, and the second neural network is obtained.

It can be seen from the above method that the second neural network trained by this method can accurately detect salient objects in any image frame, thereby obtaining the background area of the image frame.

A fourth aspect of the embodiments of the present application provides an image processing apparatus, which is the aforementioned terminal equipment, and the apparatus includes: an acquisition module for acquiring depth information of a background region of an image frame to be trained; a division module for using The background area is divided into a plurality of sub-areas according to the depth information, and the distances from the objects corresponding to different sub-areas to the camera are different, and the camera is used to capture the current image frame; the processing module is used for different sub-areas. to obtain the processed current image frame.

It can be seen from the above device that after acquiring the depth information of the background area of the current image frame, the terminal device divides the background area into multiple sub-areas according to the depth information. Since the distances from the objects corresponding to the different sub-regions to the camera are different, the motion conditions of the different sub-regions relative to the previous image frame are also different. Therefore, the terminal device can perform blurring processing on different sub-regions to different degrees, so that the background region of the current image frame has a more realistic blurring effect.

In a possible implementation manner, the depth information of the background area of the current image frame includes the depth value of each pixel in the background area of the current image frame, and the dividing module is specifically used for: according to each pixel in the background area of the current image frame The depth value of the pixel point determines the depth change rate of each pixel point in the background area of the current image frame. The depth change rate of each pixel point is based on the depth value of the pixel point and the depth values of the remaining pixels around the pixel point. Determine; divide the background area into multiple sub-areas according to the depth change rate of each pixel and the preset change rate threshold.

In a possible implementation manner, the processing module is specifically configured to: in multiple sub-regions, obtain a motion vector of each sub-region, and the motion vector of each sub-region is used to indicate the motion of the sub-region relative to the previous image frame ; Perform a blurring process on each sub-region according to the motion vector of the sub-region to obtain the processed current image frame.

In a possible implementation manner, the processing module is specifically configured to: for each sub-area in the multiple sub-areas, determine the sub-area according to the movement speed of at least one target pixel in the sub-area from the previous image frame to the current image frame The movement speed of the area; the movement direction of the sub-area is determined according to the movement direction of at least one target pixel from the previous image frame to the current image frame.

In a possible implementation manner, the processing module is specifically used to: for each sub-region, construct a convolution kernel corresponding to the sub-region according to the movement speed of the sub-region and the movement direction of the sub-region; The convolution kernel performs convolution processing on the sub-region.

In a possible implementation manner, at least one target pixel point is a corner point.

In a possible implementation manner, the movement speed and movement direction of at least one target pixel point are acquired by an optical flow method.

In a possible implementation manner, the acquisition module is specifically used to: acquire the depth value of each pixel in the current image frame and the background area of the current image frame; from the depth values of all pixels in the current image frame, determine the current The depth value of each pixel in the background area of the image frame.

In a possible implementation manner, the acquiring module is specifically configured to acquire the depth value of each pixel in the current image frame through the first neural network.

In a possible implementation manner, the camera is a depth camera, and the acquiring module is specifically configured to acquire the depth value of each pixel in the current image frame through the depth camera.

In a possible implementation manner, the acquiring module is specifically configured to acquire the background area of the current image frame through the second neural network.

In a possible implementation manner, the depth camera is a TOF camera or a structured light camera.

A fifth aspect of the embodiments of the present application provides a model training device, the device comprising: an acquisition module for acquiring the depth value of each pixel in the image frame to be trained through the first to-be-trained model; a calculation module for Calculate the difference between the depth value of each pixel in the image frame to be trained and the true depth value of each pixel in the image frame to be trained through the preset target loss function; the update module is used to adjust the Once the parameters of the model to be trained are updated until the model training conditions are met, the first neural network is obtained.

It can be seen from the above device that the first neural network trained by the device can perform accurate monocular depth estimation on any image frame, thereby obtaining the depth values of all pixels in the image frame.

A sixth aspect of the embodiments of the present application provides a model training device, the device includes: an acquisition module for acquiring a background area of an image frame to be trained by using a second to-be-trained model; a calculation module for passing a preset target The loss function calculates the deviation between the background area of the image frame to be trained and the real background area of the image frame to be trained; the update module is used to update the parameters of the second model to be trained according to the deviation until the model training conditions are met, Get the second neural network.

It can be seen from the above device that the second neural network trained by the device can accurately detect salient objects in any image frame, thereby obtaining the background area of the image frame.

A seventh aspect of the embodiments of the present application provides an image processing apparatus, the apparatus includes a memory and a processor; the memory stores code, the processor is configured to execute the code, and when the code is executed, the image processing apparatus executes the first The method described in the aspect or any one of the possible implementations of the first aspect.

An eighth aspect of the embodiments of the present application provides a model training apparatus, the apparatus includes a memory and a processor; the memory stores code, the processor is configured to execute the code, and when the code is executed, the model training apparatus executes the second The method of aspect or the third aspect.

A ninth aspect of an embodiment of the present application provides a circuit system, where the circuit system includes a processing circuit, and the processing circuit is configured to execute any one of the first aspect, any possible implementation manner of the first aspect, the second aspect, or the third aspect the method described in the aspect.

A tenth aspect of an embodiment of the present application provides a chip system, where the chip system includes a processor for calling a computer program or computer instruction stored in a memory, so that the processor executes any of the first aspect and the first aspect A possible implementation manner, the method described in the second aspect or the third aspect.

In one possible implementation, the processor is coupled to the memory through an interface.

In a possible implementation manner, the chip system further includes a memory, and the memory stores computer programs or computer instructions.

An eleventh aspect of the embodiments of the present application provides a computer storage medium, where the computer storage medium stores a computer program, and when the program is executed by a computer, the computer enables the computer to implement any one of the first aspect and the first aspect. The implementation manner of , the method described in the second aspect or the third aspect.

A twelfth aspect of the embodiments of the present application provides a computer program product, where the computer program product stores instructions that, when executed by a computer, cause the computer to implement any one of the possible implementations of the first aspect and the first aspect manner, the method of the second aspect or the third aspect.

In this embodiment of the present application, after acquiring the depth information of the background area of the current image frame, the terminal device divides the background area into multiple sub-areas according to the depth information. Since the distances from the objects corresponding to the different sub-regions to the camera are different, the motion conditions of the different sub-regions relative to the previous image frame are also different. Therefore, the terminal device can obtain the motion vector of each sub-region, and the motion vector of each sub-region is used to indicate the motion of the sub-region relative to the previous image frame. Since the motion conditions of different sub-areas are different, that is, the motion vectors of different sub-areas are different, the terminal device can perform blurring processing on the sub-areas according to the motion vector of each sub-area. , blurring different sub-regions to different degrees, so that the background region of the current image frame has a more realistic blurring effect.

Description of drawings

Fig. 1 is a kind of structural schematic diagram of artificial intelligence main frame;

2a is a schematic structural diagram of an image processing system provided by an embodiment of the present application;

FIG. 2b is another schematic structural diagram of an image processing system provided by an embodiment of the present application;

FIG. 2c is a schematic diagram of a related device for image processing provided by an embodiment of the present application;

FIG. 3a is a schematic diagram of the architecture of the system 100 provided by the embodiment of the present application;

Figure 3b is a schematic diagram of a panning lens;

4 is a schematic flowchart of an image processing method provided by an embodiment of the present application;

FIG. 5 is a schematic diagram of an application scenario of the image processing method provided by the embodiment of the present application;

6 is a schematic diagram of an application example of the image processing method provided by the embodiment of the present application;

FIG. 7 is another schematic diagram of an application example of the image processing method provided by the embodiment of the present application;

8 is a schematic flowchart of a model training method provided by an embodiment of the present application;

9 is another schematic flowchart of a model training method provided by an embodiment of the present application;

FIG. 10 is a schematic structural diagram of an image processing apparatus provided by an embodiment of the present application;

11 is a schematic structural diagram of a model training apparatus provided by an embodiment of the application;

FIG. 12 is another schematic structural diagram of a model training apparatus provided by an embodiment of the present application;

FIG. 13 is a schematic structural diagram of an execution device provided by an embodiment of the present application;

14 is a schematic structural diagram of a training device provided by an embodiment of the application;

FIG. 15 is a schematic structural diagram of a chip provided by an embodiment of the present application.

Detailed ways

The technical solutions in the embodiments of the present application will be described in detail below with reference to the accompanying drawings in the embodiments of the present application.

The terms "first", "second" and the like in the description and claims of the present application and the above drawings are used to distinguish similar objects, and are not necessarily used to describe a specific order or sequence. It should be understood that the terms used in this way can be interchanged under appropriate circumstances, and this is only a distinguishing manner adopted when describing objects with the same attributes in the embodiments of the present application. Furthermore, the terms "comprising" and "having" and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, product or device comprising a series of elements is not necessarily limited to those elements, but may include no explicit or other units inherent to these processes, methods, products, or devices.

Panning refers to a shooting method of tracking a moving target object. The image obtained by this shooting method can present a clear foreground area (including the target object) and a blurred background area. However, panning is usually difficult and uncontrollable. In order to obtain an ideal panning lens, the user can obtain a set of image frames through the terminal device at a high shutter speed (because the shutter speed is too high, the background area of this set of image frames does not have obvious blurring effect), Then, the current image frame (that is, any one image frame in the group of image frames) is processed through the frame mixing technology, so that the background area of the current image frame has a blur effect.

Due to the limitations of the frame blending technology, if the number of image frames for frame blending is small, the blurring effect of the background area of the current image frame is often unrealistic, such as ghosting and blurring of the background area.

In order to solve the above problems, the present application provides an image processing method, which can be implemented in combination with artificial intelligence (artificial intelligence, AI) technology. AI technology is a technical discipline that uses digital computers or machines controlled by digital computers to simulate, extend and expand human intelligence. AI technology obtains the best results by perceiving the environment, acquiring knowledge and using knowledge. In other words, artificial intelligence technology is a branch of computer science that attempts to understand the essence of intelligence and produce a new kind of intelligent machine that can respond in a similar way to human intelligence. Image processing using artificial intelligence is a common application of artificial intelligence.

First, the overall workflow of the artificial intelligence system will be described. Please refer to Figure 1. Figure 1 is a structural schematic diagram of the main frame of artificial intelligence. The above-mentioned artificial intelligence theme framework is elaborated in two dimensions. Among them, the "intelligent information chain" reflects a series of processes from data acquisition to processing. For example, it can be the general process of intelligent information perception, intelligent information representation and formation, intelligent reasoning, intelligent decision-making, intelligent execution and output. In this process, data has gone through the process of "data-information-knowledge-wisdom". The "IT value chain" reflects the value brought by artificial intelligence to the information technology industry from the underlying infrastructure of human intelligence, information (providing and processing technology implementation) to the industrial ecological process of the system.

(1) Infrastructure

The infrastructure provides computing power support for artificial intelligence systems, realizes communication with the outside world, and supports through the basic platform. Communication with the outside world through sensors; computing power is provided by smart chips (hardware acceleration chips such as CPU, NPU, GPU, ASIC, FPGA); the basic platform includes distributed computing framework and network-related platform guarantee and support, which can include cloud storage and computing, interconnection networks, etc. For example, sensors communicate with external parties to obtain data, and these data are provided to the intelligent chips in the distributed computing system provided by the basic platform for calculation.

(2) Data

The data on the upper layer of the infrastructure is used to represent the data sources in the field of artificial intelligence. The data involves graphics, images, voice, and text, as well as IoT data from traditional devices, including business data from existing systems and sensory data such as force, displacement, liquid level, temperature, and humidity.

(3) Data processing

Data processing usually includes data training, machine learning, deep learning, search, reasoning, decision-making, etc.

Among them, machine learning and deep learning can perform symbolic and formalized intelligent information modeling, extraction, preprocessing, training, etc. on data.

Reasoning refers to the process of simulating human's intelligent reasoning method in a computer or intelligent system, using formalized information to carry out machine thinking and solving problems according to the reasoning control strategy, and the typical function is search and matching.

Decision-making refers to the process of making decisions after intelligent information is reasoned, usually providing functions such as classification, sorting, and prediction.

(4) General ability

After the above-mentioned data processing, some general capabilities can be formed based on the results of data processing, such as algorithms or a general system, such as translation, text analysis, computer vision processing, speech recognition, image identification, etc.

(5) Smart products and industry applications

Intelligent products and industry applications refer to the products and applications of artificial intelligence systems in various fields. They are the encapsulation of the overall solution of artificial intelligence, and the productization of intelligent information decision-making to achieve landing applications. Its application areas mainly include: intelligent terminals, intelligent transportation, Smart healthcare, autonomous driving, smart city, etc.

Next, several application scenarios of the present application are introduced.

FIG. 2a is a schematic structural diagram of an image processing system provided by an embodiment of the present application, where the image processing system includes a user equipment and a data processing device. The user equipment includes smart terminals such as mobile phones, personal computers, or information processing centers. The user equipment is the initiator of image processing. As the initiator of the image processing request, the user usually initiates the request through the user equipment.

The above-mentioned data processing device may be a device or server with data processing functions, such as a cloud server, a network server, an application server, and a management server. The data processing device receives the image processing request from the intelligent terminal through the interactive interface, and then performs image processing in the form of machine learning, deep learning, search, reasoning, decision-making, etc. through the memory for storing data and the processor for data processing. The memory in the data processing device may be a general term, including local storage and a database for storing historical data. The database may be on the data processing device or on other network servers.

In the image processing system shown in FIG. 2a, the user equipment can receive instructions from the user, for example, the user equipment can acquire an image input/selected by the user, and then initiate a request to the data processing equipment, so that the data processing equipment can target the data obtained by the user equipment. The image executes an image processing application (eg, image depth estimation, image object detection, image blurring, etc.), resulting in corresponding processing results for the image. Exemplarily, the user equipment may acquire an image input by the user, and then initiate an image depth estimation request to the data processing device, so that the data processing device performs monocular depth estimation on the image, thereby obtaining the depth information of the image.

In Fig. 2a, the data processing device may execute the image processing method of the embodiment of the present application.

Fig. 2b is another schematic structural diagram of the image processing system provided by the embodiment of the application. In Fig. 2b, the user equipment is directly used as a data processing device, and the user equipment can directly obtain the input from the user and directly perform the processing by the hardware of the user equipment itself. The specific process of the processing is similar to that of FIG. 2a, and reference may be made to the above description, which will not be repeated here.

In the image processing system shown in Fig. 2b, the user equipment can receive instructions from the user, for example, the user equipment can acquire an image selected by the user in the user equipment, and then the user equipment can execute an image processing application ( For example, image depth estimation, image object detection, image blur processing, etc.), so as to obtain the corresponding processing result for the image.

In FIG. 2b, the user equipment itself can execute the image processing method of the embodiment of the present application.

FIG. 2c is a schematic diagram of a related device for image processing provided by an embodiment of the present application.

The user equipment in the above-mentioned FIGS. 2a and 2b may specifically be the local device 301 or the local device 302 in FIG. 2c, and the data processing device in FIG. 2a may specifically be the execution device 210 in FIG. 2c, wherein the data storage system 250 may be To store the data to be processed by the execution device 210, the data storage system 250 may be integrated on the execution device 210, or may be set on the cloud or other network servers.

The processors in Figures 2a and 2b may perform data training/machine learning/deep learning through a neural network model or other model (eg, a support vector machine-based model), and use the data to finally train or learn the model to execute on the image Image processing application, so as to obtain the corresponding processing results.

Fig. 3a is a schematic diagram of the architecture of the system 100 provided by the embodiment of the present application. In Fig. 3a, the execution device 110 is configured with an input/output (I/O) interface 112, which is used for data interaction with external devices, and the user Data may be input to the I/O interface 112 through the client device 140, and in this embodiment of the present application, the input data may include: various tasks to be scheduled, callable resources, and other parameters.

When the execution device 110 preprocesses the input data, or the calculation module 111 of the execution device 110 performs computation and other related processing (for example, performing the function realization of the neural network in this application), the execution device 110 may call the data storage system 150 The data, codes, etc. in the corresponding processing can also be stored in the data storage system 150 .

Finally, the I/O interface 112 returns the processing results to the client device 140 for provision to the user.

It is worth noting that the training device 120 can generate corresponding target models/rules based on different training data for different goals or tasks, and the corresponding target models/rules can be used to achieve the above-mentioned goals or complete the above-mentioned tasks. , which provides the user with the desired result. The training data may be stored in the database 130 and come from training samples collected by the data collection device 160 .

In the case shown in FIG. 3 a , the user can manually specify input data, which can be operated through the interface provided by the I/O interface 112 . In another case, the client device 140 can automatically send the input data to the I/O interface 112 . If the user's authorization is required to request the client device 140 to automatically send the input data, the user can set corresponding permissions in the client device 140 . The user can view the result output by the execution device 110 on the client device 140, and the specific presentation form can be a specific manner such as display, sound, and action. The client device 140 can also be used as a data collection terminal to collect the input data of the input I/O interface 112 and the output result of the output I/O interface 112 as new sample data as shown in the figure, and store them in the database 130 . Of course, it is also possible not to collect through the client device 140, but the I/O interface 112 directly uses the input data input into the I/O interface 112 and the output result of the output I/O interface 112 as shown in the figure as a new sample The data is stored in database 130 .

It is worth noting that FIG. 3a is only a schematic diagram of a system architecture provided by an embodiment of the present application, and the positional relationship between the devices, devices, modules, etc. shown in the figure does not constitute any limitation. For example, in FIG. 3a, the data The storage system 150 is an external memory relative to the execution device 110 , and in other cases, the data storage system 150 may also be placed in the execution device 110 . As shown in FIG. 3a, the neural network can be obtained by training according to the training device 120.

An embodiment of the present application also provides a chip, where the chip includes a neural network processor NPU. The chip can be set in the execution device 110 as shown in FIG. 3 a to complete the calculation work of the calculation module 111 . The chip can also be set in the training device 120 as shown in FIG. 3a to complete the training work of the training device 120 and output the target model/rule.

The neural network processor NPU is mounted on the main central processing unit (CPU) (host CPU) as a co-processor, and tasks are allocated by the main CPU. The core part of the NPU is an arithmetic circuit, and the controller controls the arithmetic circuit to extract the data in the memory (weight memory or input memory) and perform operations.

In some implementations, the arithmetic circuit includes multiple processing units (process engines, PEs). In some implementations, the arithmetic circuit is a two-dimensional systolic array. The arithmetic circuit may also be a one-dimensional systolic array or other electronic circuit capable of performing mathematical operations such as multiplication and addition. In some implementations, the arithmetic circuit is a general-purpose matrix processor.

For example, suppose there is an input matrix A, a weight matrix B, and an output matrix C. The operation circuit fetches the data corresponding to the matrix B from the weight memory, and buffers it on each PE in the operation circuit. The arithmetic circuit fetches the data of matrix A from the input memory and performs matrix operation on matrix B, and stores the partial result or final result of the matrix in an accumulator.

The vector calculation unit can further process the output of the arithmetic circuit, such as vector multiplication, vector addition, exponential operation, logarithmic operation, size comparison and so on. For example, the vector computing unit can be used for network computation of non-convolutional/non-FC layers in neural networks, such as pooling, batch normalization, local response normalization, etc.

In some implementations, the vector computation unit can store the processed output vector to a unified buffer. For example, the vector computing unit may apply a nonlinear function to the output of the arithmetic circuit, such as a vector of accumulated values, to generate activation values. In some implementations, the vector computation unit generates normalized values, merged values, or both. In some implementations, the vector of processed outputs can be used as activation input to an arithmetic circuit, such as for use in subsequent layers in a neural network.

Unified memory is used to store input data as well as output data.

The weight data directly transfers the input data in the external memory to the input memory and/or the unified memory through the direct memory access controller (DMAC), stores the weight data in the external memory into the weight memory, and transfers the unified memory store the data in the external memory.

The bus interface unit (BIU) is used to realize the interaction between the main CPU, the DMAC and the instruction fetch memory through the bus.

The instruction fetch buffer connected to the controller is used to store the instructions used by the controller;

The controller is used for invoking the instructions cached in the memory to realize and control the working process of the operation accelerator.

Generally, the unified memory, input memory, weight memory and instruction fetch memory are all on-chip memories, and the external memory is the memory outside the NPU, and the external memory can be double data rate synchronous dynamic random access memory (double data rate synchronous dynamic random access memory, DDR SDRAM), high bandwidth memory (HBM), or other readable and writable memory.

Since the embodiments of the present application involve a large number of neural network applications, for ease of understanding, related terms and neural networks and other related concepts involved in the embodiments of the present application are first introduced below.

(1) Neural network

A neural network can be composed of neural units, and a neural unit can refer to an operation unit that takes xs and intercept 1 as inputs, and the output of the operation unit can be:

Among them, s=1, 2,...n, n is a natural number greater than 1, Ws is the weight of xs, and b is the bias of the neural unit. f is the activation function of the neural unit, which is used to introduce nonlinear characteristics into the neural network to convert the input signal in the neural unit into an output signal. The output signal of this activation function can be used as the input of the next convolutional layer. The activation function can be a sigmoid function. A neural network is a network formed by connecting many of the above single neural units together, that is, the output of one neural unit can be the input of another neural unit. The input of each neural unit can be connected with the local receptive field of the previous layer to extract the features of the local receptive field, and the local receptive field can be an area composed of several neural units.

The work of each layer in the neural network can be described by the mathematical expression y=a(Wx+b): from the physical level, the work of each layer in the neural network can be understood as a Set) operation to complete the transformation from input space to output space (that is, from the row space of the matrix to the column space). These five operations include: 1. Dimension raising/lowering; 2. Enlarging/reducing; 3. Rotation; 4. Translation; 5, "bend". Among them, the operations of 1, 2, and 3 are completed by Wx, the operation of 4 is completed by +b, and the operation of 5 is implemented by a(). The reason why the word "space" is used here is because the object to be classified is not a single thing, but a type of thing, and space refers to the collection of all individuals of this type of thing. Among them, W is the weight vector, and each value in the vector represents the weight value of a neuron in the neural network of this layer. This vector W determines the space transformation from the input space to the output space described above, that is, the weight W of each layer controls how the space is transformed. The purpose of training the neural network is to finally obtain the weight matrix of all layers of the trained neural network (the weight matrix formed by the vectors W of many layers). Therefore, the training process of the neural network is essentially learning the way to control the spatial transformation, and more specifically, learning the weight matrix.

Because we want the output of the neural network to be as close as possible to the value you really want to predict, you can compare the predicted value of the current network with the target value you really want, and then update each layer of the neural network according to the difference between the two. (of course, there is usually an initialization process before the first update, that is, pre-configuring parameters for each layer in the neural network), for example, if the predicted value of the network is high, adjust the weight vector to make it predict low Some, keep adjusting until the neural network can predict the real desired target value. Therefore, it is necessary to pre-define "how to compare the difference between the predicted value and the target value", which is the loss function (loss function) or objective function (objective function), which are used to measure the difference between the predicted value and the target value. important equation. Among them, taking the loss function as an example, the higher the output value of the loss function (loss), the greater the difference, then the training of the neural network becomes the process of reducing the loss as much as possible.

(2) Back propagation algorithm

The neural network can use the error back propagation (BP) algorithm to correct the size of the parameters in the initial neural network model during the training process, so that the reconstruction error loss of the neural network model becomes smaller and smaller. Specifically, the input signal is passed forward until the output will generate error loss, and the parameters in the initial neural network model are updated by back-propagating the error loss information, so that the error loss converges. The back-propagation algorithm is a back-propagation movement dominated by error loss, aiming to obtain the parameters of the optimal neural network model, such as the weight matrix.

(3) Panning

Panning refers to a shooting technique that uses a slow shutter speed to track the target object. The specific method is to follow the moving target object, shake the camera in the same direction and shoot at a relatively close speed. This technique is mainly used for sports themes. photography. The panning lens refers to the images captured by the above-mentioned shooting methods. Such images can present an artistic effect of dynamic blurring of the background, that is, the foreground area (including the target object) of such images is clear and the background area is blurred. As shown in Figure 3b (Figure 3b is a schematic diagram of a panning lens), in this panning lens, the foreground area (ie the moving car) is clear, and the background area (ie the surrounding environment and other objects near the car) is vague.

The method provided by the present application will be described below from the training side of the neural network and the application side of the neural network.

The model training method provided by the embodiment of the present application involves the processing of images, and can be specifically applied to data processing methods such as data training, machine learning, deep learning, etc., for symbolizing and transforming training data (such as the image frames to be trained in the present application). Formalized intelligent information modeling, extraction, preprocessing, training, etc., finally obtain a trained neural network (such as the first neural network and the second neural network in this application); and, the image processing provided by the embodiments of this application The method can use the above-mentioned trained neural network, input the input data (such as the current image frame in this application) into the trained neural network, and obtain output data (such as the depth information of the current image frame in this application, the current background area of an image frame, etc.). It should be noted that the model training method and image processing method provided in the embodiments of this application are inventions based on the same idea, and can also be understood as two parts in a system, or two stages of an overall process: such as model The training phase and the model application phase.

FIG. 4 is a schematic flowchart of an image processing method provided by an embodiment of the present application. The background area of an image frame processed by this method may have a real blur effect. As shown in FIG. 5 (FIG. 5 is a schematic diagram of an application scenario of the image processing method provided by the embodiment of the present application), the terminal device may select a certain image frame from a group of continuous image frames, and process it, thereby Make the background area of the image frame have a real dynamic blur effect. In addition, the terminal device can also process each image frame of the group of image frames, so that the background area of each image frame has a real dynamic blur effect.

The image processing method provided by the embodiment of the present application will be introduced in detail below. As shown in FIG. 4 , the method includes:

401. Acquire the depth information of the current image frame and the background area of the current image frame.

When the user needs to pan the moving target object, a group of continuous image frames can be acquired at a higher shutter speed through the camera of the terminal device (ie, the aforementioned user device or client device). Specifically, the user may capture the set of image frames in various ways. For example, the user can set the mode of the camera of the terminal device to the image continuous shooting mode, and then long press the shutter to acquire the group of image frames. For another example, the user can continuously press the shutter to acquire the group of image frames. For another example, the user can determine whether the current shooting scene conforms to a specific scene (a scene in which the target object is in motion) through the perception technology of the terminal device, and if so, trigger continuous shooting or multiple shooting to obtain the group of image frames. For another example, the user can set the mode of the camera of the terminal device to the video recording mode, so as to obtain the group of image frames and so on.

In this group of image frames, all image frames are sorted chronologically, and each image frame includes a foreground area and a background area, wherein the foreground area and the background area both contain (present) the subject, and the foreground area contains The photographed object is generally the target object that the user pays attention to, and the photographed object contained in the background area is the non-target object that the user does not pay attention to. For example, the subject contained in the foreground area may be a moving car, and the subject contained in the background area may be the sky, flowers and plants, roads, street lights, etc. around the car. For another example, the subject contained in the foreground area may be a person skiing, and the subject contained in the background area may be a house, snow, trees, etc. around the person.

Since the background area of the group of image frames does not have an obvious blurring effect, the terminal device needs to process it so that a certain image frame or the background area of some image frames has a real blurring effect. In the group of image frames, the terminal device may select any one of the image frames as the image frame to be processed, that is, the current image frame. Specifically, the terminal device can select the current image frame in various ways. For example, the terminal device may determine the current image frame from the group of image frames according to the instruction input by the user, that is, the current image frame is the image frame designated by the user. For another example, the terminal device may score each image frame of the group of image frames according to an aesthetic evaluation algorithm, and determine the image frame with the highest score as the current image frame.

It is determined that in the current image frame, the terminal device can obtain the depth information of the current image frame and the background area of the current image frame, wherein the depth information of the current image frame is the depth value of each pixel in the current image frame, that is, the current image frame. The depth values of all pixels in . In the current image frame, the depth value of each pixel is used for the distance from the corresponding position of the pixel in the actual environment (three-dimensional space) to the camera. In this way, the depth information of the current image frame can be used to indicate the distance from each subject included in the current image frame to the camera, that is, the distance from the position of these subjects in the actual environment to the camera.

It is worth noting that the terminal device can obtain the depth value of each pixel in the current image frame in various ways. For example, the terminal device can obtain the depth value of each pixel in the current image frame through the first neural network, that is, the monocular depth estimation is performed on the current image frame through the first neural network, so as to obtain the depth of all pixels in the current image frame. value. For another example, the terminal device has a depth camera, so after the terminal device obtains the current image frame through the depth camera, it can also simultaneously obtain the depth values of all pixels in the current image frame. Further, the depth camera of the terminal device may be a TOF camera or a structured light camera.

The terminal device can also obtain the background area of the current image frame in various ways. For example, the terminal device can obtain the background area of the current image frame through the second neural network, that is, the terminal device can perform salient object detection on the current image frame through the second neural network (directly detect the most obvious object in the current image frame, that is, the target object), directly distinguish the foreground area and background area of the current image frame, or the terminal device can perform target detection on the current image frame through the second neural network (detect each subject in the current image frame) and target Segmentation (determination of the target object from the photographed objects). For another example, the terminal device may divide a foreground area and a background area in the current image frame according to the user's instruction, and so on.

It should be understood that the first neural network can be a multi-layer perceptron (MLP), a convolutional neural network (CNN), a recurrent neural network (recursive neural network), a recurrent neural network (recurrent neural network) , RNN) and other models, the second neural network can also be any one of MLP, CNN, recurrent neural network, RNN and other models, which is not limited here.

It should also be understood that the first neural network and the second neural network in the embodiments of the present application are both trained neural network models. The following will briefly introduce the training process of the first neural network and the second neural network:

(1) Before performing model training, obtain a certain batch of image frames to be trained, and determine the true depth values of all pixels in each image frame to be trained in advance. After starting training, a certain image frame to be trained can be input to the first model to be trained. Then, the depth value of each pixel in the to-be-trained image frame is obtained through the first to-be-trained model. Finally, calculate the difference between the depth value of each pixel in the image frame to be trained output by the first model to be trained and the real depth value through the preset target loss function, if the difference is within the qualified range, then The image frame to be trained is regarded as a qualified image frame to be trained, and if it is outside the qualified range, it is regarded as an unqualified image frame to be trained. For the batch of training image frames, the foregoing process needs to be performed for each training image frame, which will not be repeated here. If there are only a small number of qualified image frames to be trained in the batch of image frames to be trained, then adjust the parameters of the first model to be trained, and re-train with another batch of image frames to be trained until there are a large number of qualified image frames to be trained frame to get the first neural network.

(2) Before performing model training, obtain a certain batch of image frames to be trained, and determine the real background area of each image to be trained in advance. After starting training, a certain image frame to be trained may be input to the second model to be trained. Then, the background area of the to-be-trained image frame is acquired through the second to-be-trained model. Finally, the difference between the background area of the image frame to be trained and the real background area output by the second model to be trained is calculated by the target loss function. If the difference is within the qualified range, the image frame to be trained is regarded as qualified. If the image frame to be trained is outside the qualified range, it is regarded as an unqualified image frame to be trained. For the batch of training image frames, the foregoing process needs to be performed for each training image frame, which will not be repeated here. If there are only a small number of qualified image frames to be trained in the batch of image frames to be trained, adjust the parameters of the second model to be trained, and re-train with another batch of image frames to be trained until there are a large number of qualified image frames to be trained frame to get the second neural network.

402. From the depth information of the current image frame, determine the depth information of the background region of the current image frame.

After obtaining the depth information of the current image frame and the background area of the current image frame, the terminal device can determine the depth information of the background area of the current image frame from the depth information of the current image frame, and the depth information of the background area of the current image frame is used for Indicates the distance from each subject contained in the background area to the camera, that is, the distance from the position of these subjects in the actual environment to the camera. Specifically, the terminal device can determine the depth value of each pixel in the background area of the current image frame from the depth values of all pixels in the current image frame, that is, the terminal device can determine the depth value of each pixel in the current image frame from all the pixels in the current image frame. Which part of the pixel point is located in the background area of the current image frame, then the depth value of this part of the pixel point is the depth value of all the pixel points in the background area of the current image frame.

403. Divide the background area into a plurality of sub-areas according to the depth information of the background area of the current image frame, and the distances from the object to the camera corresponding to different sub-areas are different.

After obtaining the depth information of the background area of the current image frame, the terminal device can divide the background area into multiple sub-areas according to the depth information, and the distances from the object to the camera corresponding to different sub-areas are different. Specifically, after obtaining the depth values of all pixels in the background area of the current image frame, the terminal device calculates the depth change rate of the pixel according to the depth value of each pixel, and the calculation formula is as follows:

G(i,j)=dx(i,j)+dy(i,j)

dx(i,j)=D(i+1,j)-D(i,j)

dy(i,j)=D(i,j+1)-D(i,j)

In the above formula, G(i,j) is the depth change rate of the pixel, D(i,j) is the depth value of the pixel, D(i,j+1), D(i+1,j) is the depth value of the remaining pixels around the pixel, i=1,2,3,...,N, j=1,2,3,...,N.

In this way, the terminal device can obtain the depth change rate of all pixels in the background area of the current image frame. For any pixel, the depth change rate of the pixel is used to indicate the depth value of the pixel and surrounding pixels. The difference between the depth values of the point, that is, the difference between the distance between the corresponding position of the pixel in the actual environment and the camera, and the distance between the corresponding position of the surrounding pixels in the actual environment and the camera. It can be seen that when the depth change rate of a certain pixel is small, it indicates the difference between the distance between the actual position of the pixel and the camera (the corresponding position in the actual environment) and the distance between the actual position of the surrounding pixels and the camera. The gap is small. When the depth change rate of a certain pixel is large, it means that there is a large gap between the distance between the actual position of the pixel and the camera and the distance between the actual position of the surrounding pixels and the camera. Therefore, the terminal device may divide the background area of the current image frame into multiple sub-areas according to the depth change rate. Specifically, the terminal device may divide the background area of the current image frame into multiple sub-areas according to the depth change rate of each pixel in the background area of the current image frame and a preset change rate threshold. It should be noted that the change rate threshold is equal to or approximately equal to the depth change rate of the edge point of each sub-region, and the change rate threshold is generally set to be larger, so the depth value of the edge point is the same as the depth of the pixel points around the edge point. There is a large difference in the values, that is, the depth value is abruptly changed at the edge points. That is to say, there is a large difference between the distance from the actual position of the edge point to the camera and the distance from the actual position of the surrounding pixels to the camera. Therefore, through the depth change rate of each pixel in the background area of the current image frame and the preset change rate threshold, the edge points of each sub-area in the background area can be determined, and then multiple sub-areas can be determined. In this way, different sub-areas correspond to actual positions of different distances, so the distances from the subjects in the same sub-area to the camera are the same or similar, and the distances from the subjects in different sub-areas to the camera are The difference is more obvious.

For example, the terminal device divides the background area of the current image frame into three sub-areas according to the depth change rate of all pixels in the background area of the current image frame. It is the plants behind the road, and the third sub-area is the buildings behind the plants. It can be seen that the subject contained in the first sub-area is closest to the camera, and the subject contained in the third sub-area is the farthest from the camera.

404. In the multiple sub-regions, obtain a motion vector of each sub-region, where the motion vector of each sub-region is used to indicate the motion of the sub-region relative to the previous image frame.

When the camera is tracking the target object, it generally rotates or translates. When the camera is moving, the movement of the objects at different distances relative to the camera (which can also be understood as the degree of movement) is different. For example, the closer objects move more, and the farther objects move Objects move to a lesser extent, which is shown in successive image frames captured by the camera. Specifically, in the multiple sub-regions of the background region of the current image frame, since the distances of the objects included in different sub-regions to the camera are different, the objects corresponding to different sub-regions have different motions relative to the camera. When the camera shoots two adjacent image frames, the position of a sub-area (which can also be understood as the subject contained in the sub-area) in the current image frame is compared with the position of the sub-area in the previous image frame , a certain change must have occurred, and the position changes of different sub-regions are different, that is, the motion of different sub-regions is different. It can be seen that, taking the previous image frame of the current image frame as a reference, different sub-regions of the background region of the current image frame have different motion conditions relative to the previous image frame.

As in the above example, suppose the background area of the current image frame contains three sub-areas, the first sub-area is the road where the car is driving, the second sub-area is the plant behind the road, and the third sub-area is the building behind the plant . Then, the movement degree of the first sub-area from the previous image frame to the current image frame is the largest, the movement degree of the second sub-area from the previous image frame to the current image frame is second, and the third sub-area is from the previous image frame to the current image frame. movement is minimal.

In order to determine the motion of each sub-area in the background area of the current image frame, the terminal device can obtain the motion vector of each sub-area, and the motion vector of each sub-area includes the motion speed of the sub-area and the motion direction of the sub-area. The motion amount of a region is used to indicate the motion of the sub-region relative to the previous image frame. Specifically, for each sub-area in the multiple sub-areas, the terminal device may first perform corner detection on the sub-area to determine at least one target pixel (ie, a corner), and this part of the target pixels is usually in the sub-area. Special evidence is more obvious pixels. Then, the terminal device uses the optical flow method to determine the moving distance of this part of the target pixels from the previous image frame to the current image frame, the position of this part of the target pixels in the previous image frame, and the part of the target pixels in the current image frame. s position. Then, the terminal device calculates the movement speed of this part of the target pixels from the previous image frame to the current image frame according to the moving distance of this part of the target pixels and the time difference between the previous image frame and the current image frame, and according to this part of the target pixel The position of the pixel point in the previous image frame and the position of this part of the target pixel point in the current image frame determine the movement direction of this part of the target pixel point from the previous image frame to the current image frame. Finally, the terminal device can determine the movement speed of the sub-region (for example, the average value of the movement speed of this part of the target pixels, etc.) according to the movement speed of this part of the target pixels, and determine the movement direction of this part of the target pixels is the movement direction of the sub-region.

It should be noted that, for the determination process of the movement speed and movement direction of the remaining sub-regions, reference may be made to the foregoing description part, which will not be repeated here.

405. Perform blurring processing on each sub-region according to the motion vector of the sub-region.

After obtaining the motion speed and motion direction of each sub-area in the background area of the current image frame, for each sub-area, the terminal device constructs the convolution kernel corresponding to the sub-area according to the motion speed of the sub-area and the motion direction of the sub-area , and then perform convolution processing on the sub-region through the corresponding convolution kernel of the sub-region. Since the motions of different sub-regions are different, the corresponding convolution kernels of different sub-regions are also different. Then, the terminal device can use this part of the convolution kernel to perform different degrees of blurring on different sub-regions. In this way, in the background area of the current image frame, different sub-areas may have different degrees of blurring effects, thereby achieving a hierarchical and more realistic dynamic blurring effect.

In this embodiment of the present application, after acquiring the depth information of the background area of the current image frame, the terminal device divides the background area into multiple sub-areas according to the depth information. Since the distances from the objects corresponding to the different sub-regions to the camera are different, the motion conditions of the different sub-regions relative to the previous image frame are also different. Therefore, the terminal device can acquire the motion vector of each sub-region, and the motion vector of each sub-region is used to indicate the motion of the sub-region relative to the previous image frame. Since the motion conditions of different sub-areas are different, that is, the motion vectors of different sub-areas are different, the terminal device can perform blurring processing on the sub-areas according to the motion vector of each sub-area. , blurring different sub-regions to different degrees, so that the background region of the current image frame has a more realistic blurring effect.

For further understanding, the image processing method provided by the embodiment of the present application is further introduced below with reference to an application example. FIG. 6 is a schematic diagram of an application example of the image processing method provided by the embodiment of the application, and FIG. 7 is another schematic diagram of the application example of the image processing method provided by the embodiment of the application. As shown in FIG. 6 and FIG. 7 , the application Examples include:

(1) After the terminal device determines the current image frame 601, it obtains the depth image 602 of the current image frame (that is, the depth information of the current image frame) through the first neural network, wherein the distances from the areas of different colors in the depth image 602 to the camera different.

(2) The terminal device obtains the salient image 603 of the current image frame through the second neural network, wherein the salient image 603 of the current image frame is used to highlight the background area of the current image frame, that is, the dark part in the salient image 603 .

(3) The terminal device combines the salient image 603 of the current image frame and the depth image 602 of the current image frame 601 to determine the depth image of the background area of the current image frame (ie, the depth information of the background area of the current image frame).

(4) The terminal device can calculate the depth change rate of each pixel in the background area according to the depth image of the background area of the current image frame, and divide the background area of the current image frame into multiple sub-areas according to the size of the depth change rate .

(5) Through the optical flow method, the terminal device uses the previous image frame 605 as a reference, and in the current image frame 601, marks the motion vector (including the motion speed and motion direction) of the corner points of each sub-area in the background area, and according to The motion vectors of the corners of each sub-region determine the speed and direction of motion of the sub-region.

(6) The terminal device determines the convolution kernel corresponding to the sub-area according to the movement speed and movement direction of the sub-area, and uses the convolution kernel corresponding to the sub-area to complete the convolution operation of the sub-area, so that the sub-area has a certain degree of blur effect. In this way, different sub-regions can have different degrees of blurring effects, so that the background region of the current image frame has a layered blurring effect, that is, a more realistic blurring effect.

The above is a detailed description of the image processing method provided by the embodiment of the present application. The model training method provided by the embodiment of the present application will be introduced below. FIG. 8 is a schematic flowchart of the model training method provided by the embodiment of the present application. The method include:

801. Obtain the depth value of each pixel in the image frame to be trained through the first model to be trained;

802. Calculate the deviation between the depth value of each pixel in the image frame to be trained and the true depth value of each pixel in the image frame to be trained by using a preset target loss function;

803. Update the parameters of the first model to be trained according to the deviation until the model training conditions are met, and obtain the first neural network.

It should be noted that, for the description of steps 801 to 803, reference may be made to the relevant description of the training process of the first neural network in the aforementioned step 401, and details are not repeated here. It can be understood that the first neural network in the aforementioned step 401 can be obtained through steps 801 to 803, and the first neural network can perform accurate monocular depth estimation on any image frame, thereby obtaining all the pixels in the image frame. The depth value of the point.

FIG. 9 is another schematic flowchart of a model training method provided by an embodiment of the present application, and the method includes:

901. Obtain a background area of an image frame to be trained by a second model to be trained;

902. Calculate the deviation between the background area of the image frame to be trained and the real background area of the image frame to be trained by using a preset target loss function;

903. Update the parameters of the second model to be trained according to the deviation until the model training conditions are met, and obtain a second neural network.

It should be noted that, for the description of steps 901 to 903, reference may be made to the relevant description of the training process of the second neural network in the foregoing step 401, and details are not repeated here. It can be understood that, through steps 901 to 903, the second neural network in the aforementioned step 401 can be obtained, and the second neural network can perform accurate salient target detection on any image frame, thereby obtaining the background area of the image frame.

The above is a detailed description of the model training method provided by the embodiment of the present application, and the image processing apparatus provided by the embodiment of the present application will be introduced below. FIG. 10 is a schematic structural diagram of an image processing apparatus provided by an embodiment of the present application. As shown in FIG. 10 , the apparatus is the aforementioned terminal equipment, and the apparatus includes:

An acquisition module 1001 is used to acquire the depth information of the background region of the image frame to be trained;

A division module 1002, configured to divide the background area into a plurality of sub-areas according to the depth information, and the distances from the objects corresponding to different sub-areas to the camera are different, and the camera is used to capture the current image frame;

The processing module 1003 is configured to perform different degrees of blurring processing on different sub-regions to obtain a processed current image frame.

In this embodiment, after acquiring the depth information of the background area of the current image frame, the terminal device divides the background area into multiple sub-areas according to the depth information. Since the distances from the objects corresponding to the different sub-regions to the camera are different, the motion conditions of the different sub-regions relative to the previous image frame are also different. Therefore, the terminal device can acquire the motion vector of each sub-region, and the motion vector of each sub-region is used to indicate the motion of the sub-region relative to the previous image frame. Since the motion conditions of different sub-areas are different, that is, the motion vectors of different sub-areas are different, the terminal device can perform blurring processing on the sub-areas according to the motion vector of each sub-area. , blurring different sub-regions to different degrees, so that the background region of the current image frame has a more realistic blurring effect.

In a possible implementation manner, the depth information of the background area of the current image frame includes the depth value of each pixel in the background area of the current image frame, and the dividing module 1002 is specifically configured to: according to each pixel in the background area of the current image frame The depth value of each pixel point is determined, and the depth change rate of each pixel point in the background area of the current image frame is determined. The depth change rate of each pixel point is based on the depth value of the pixel point and the depth of the remaining pixels around the pixel point. The value is determined; according to the depth change rate of each pixel point and the preset change rate threshold, the background area is divided into multiple sub-areas.

In a possible implementation manner, the processing module 1003 is specifically configured to: in multiple sub-regions, obtain a motion vector of each sub-region, and the motion vector of each sub-region is used to indicate the motion of the sub-region relative to the previous image frame Situation; perform blurring processing on each sub-area according to the motion vector of the sub-area to obtain the processed current image frame.

In a possible implementation manner, the processing module 1003 is specifically configured to: for each sub-area in the plurality of sub-areas, according to the movement speed of at least one target pixel in the sub-area from the previous image frame to the current image frame, determine the The movement speed of the sub-area; the movement direction of the sub-area is determined according to the movement direction of at least one target pixel from the previous image frame to the current image frame.

In a possible implementation manner, the processing module 1003 is specifically configured to: for each sub-area, construct a convolution kernel corresponding to the sub-area according to the movement speed of the sub-area and the movement direction of the sub-area; The convolution kernel of , performs convolution processing on this sub-region.

In a possible implementation manner, the obtaining module 1001 is specifically configured to: obtain the depth value of each pixel in the current image frame and the background area of the current image frame; from the depth values of all pixels in the current image frame, determine The depth value of each pixel in the background area of the current image frame.

In a possible implementation manner, the obtaining module 1001 is specifically configured to obtain the depth value of each pixel in the current image frame through the first neural network.

In a possible implementation manner, the camera is a depth camera, and the acquiring module 1001 is specifically configured to acquire the depth value of each pixel in the current image frame through the depth camera.

In a possible implementation manner, the acquiring module 1001 is specifically configured to acquire the background area of the current image frame through the second neural network.

The above is a detailed description of the image processing apparatus provided by the embodiments of the present application, and the model training apparatus provided by the embodiments of the present application will be introduced below. FIG. 11 is a schematic structural diagram of a model training apparatus provided by an embodiment of the application. As shown in FIG. 11 , the apparatus includes:

Obtaining module 1101, for obtaining the depth value of each pixel in the image frame to be trained through the first model to be trained;

The calculation module 1102 is used to calculate the deviation between the depth value of each pixel in the image frame to be trained and the true depth value of each pixel in the image frame to be trained through a preset target loss function;

The updating module 1103 is configured to update the parameters of the first model to be trained according to the deviation until the model training conditions are met, and the first neural network is obtained.

FIG. 12 is another schematic structural diagram of a model training device provided by an embodiment of the present application. As shown in FIG. 12 , the device includes:

The acquisition module 1201 is used for acquiring the background area of the image frame to be trained through the second to-be-trained model;

The calculation module 1202 is used to calculate the deviation between the background area of the image frame to be trained and the real background area of the image frame to be trained through a preset target loss function;

The updating module 1203 is configured to update the parameters of the second model to be trained according to the deviation until the model training conditions are met, and a second neural network is obtained.

It should be noted that the information exchange, execution process and other contents among the modules/units of the above-mentioned apparatus are based on the same concept as the method embodiments of the present application, and the technical effects brought by them are the same as those of the method embodiments of the present application, and the specific contents can be Reference is made to the descriptions in the method embodiments shown in the foregoing embodiments of the present application, which will not be repeated here.

The embodiment of the present application also relates to an execution device, and FIG. 13 is a schematic structural diagram of the execution device provided by the embodiment of the present application. As shown in FIG. 13 , the execution device 1300 may specifically be represented as a mobile phone, a tablet, a notebook computer, a smart wearable device, a server, etc., which is not limited here. The image processing apparatus described in the embodiment corresponding to FIG. 10 may be deployed on the execution device 1300 to implement the image processing function in the embodiment corresponding to FIG. 4 . Specifically, the execution device 1300 includes: a receiver 1301, a transmitter 1302, a processor 1303, and a memory 1304 (wherein the number of processors 1303 in the execution device 1300 may be one or more, and one processor is taken as an example in FIG. 13 ) , wherein the processor 1303 may include an application processor 13031 and a communication processor 13032 . In some embodiments of the present application, the receiver 1301, the transmitter 1302, the processor 1303, and the memory 1304 may be connected by a bus or otherwise.

Memory 1304 may include read-only memory and random access memory, and provides instructions and data to processor 1303 . A portion of memory 1304 may also include non-volatile random access memory (NVRAM). The memory 1304 stores processors and operating instructions, executable modules or data structures, or a subset thereof, or an extended set thereof, wherein the operating instructions may include various operating instructions for implementing various operations.

The processor 1303 controls the operation of the execution device. In a specific application, various components of the execution device are coupled together through a bus system, where the bus system may include a power bus, a control bus, a status signal bus, and the like in addition to a data bus. However, for the sake of clarity, the various buses are referred to as bus systems in the figures.

The methods disclosed in the above embodiments of the present application may be applied to the processor 1303 or implemented by the processor 1303 . The processor 1303 may be an integrated circuit chip, which has signal processing capability. In the implementation process, each step of the above-mentioned method can be completed by an integrated logic circuit of hardware in the processor 1303 or an instruction in the form of software. The above-mentioned processor 1303 can be a general-purpose processor, a digital signal processor (digital signal processing, DSP), a microprocessor or a microcontroller, and may further include an application specific integrated circuit (ASIC), a field programmable Field-programmable gate array (FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components. The processor 1303 may implement or execute the methods, steps, and logical block diagrams disclosed in the embodiments of this application. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of the method disclosed in conjunction with the embodiments of the present application may be directly embodied as being executed by a hardware decoding processor, or by a combination of hardware and software modules in the decoding processor. The software modules may be located in random access memory, flash memory, read-only memory, programmable read-only memory or electrically erasable programmable memory, registers and other storage media mature in the art. The storage medium is located in the memory 1304, and the processor 1303 reads the information in the memory 1304, and completes the steps of the above method in combination with its hardware.

The receiver 1301 can be used to receive input numerical or character information, and to generate signal input related to performing the relevant setting and function control of the device. The transmitter 1302 can be used to output digital or character information through the first interface; the transmitter 1302 can also be used to send instructions to the disk group through the first interface to modify the data in the disk group; the transmitter 1302 can also include a display device such as a display screen .

In the embodiment of the present application, in one case, the processor 1303 is configured to execute the image processing method executed by the terminal device in the embodiment corresponding to FIG. 4 .

The embodiment of the present application also relates to a training device, and FIG. 14 is a schematic structural diagram of the training device provided by the embodiment of the present application. As shown in FIG. 14 , the training device 1400 is implemented by one or more servers. The training device 1400 may vary greatly due to different configurations or performances, and may include one or more central processing units (CPUs) 1414 (eg, one or more processors) and memory 1432, one or more storage media 1430 (eg, one or more mass storage devices) that store applications 1442 or data 1444. Among them, the memory 1432 and the storage medium 1430 may be short-term storage or persistent storage. The program stored in the storage medium 1430 may include one or more modules (not shown in the figure), and each module may include a series of instructions to operate on the training device. Further, the central processing unit 1414 may be configured to communicate with the storage medium 1430 to execute a series of instruction operations in the storage medium 1430 on the training device 1400 .

The training device 1400 may also include one or more power supplies 1426, one or more wired or wireless network interfaces 1450, one or more input and output interfaces 1458; or, one or more operating systems 1441, such as Windows Server™, Mac OS X™ , UnixTM, LinuxTM, FreeBSDTM and so on.

Specifically, the training device may perform the steps in the embodiment corresponding to FIG. 8 or FIG. 9 .

The embodiments of the present application also relate to a computer storage medium, where a program for performing signal processing is stored in the computer-readable storage medium, and when it runs on a computer, the computer causes the computer to perform the steps performed by the aforementioned execution device, or, The computer is caused to perform the steps as performed by the aforementioned training device.

The embodiments of the present application also relate to a computer program product, where the computer program product stores instructions, which, when executed by the computer, cause the computer to execute the steps executed by the aforementioned execution device, or cause the computer to execute the steps executed by the aforementioned training device A step of.

The execution device, training device, or terminal device provided in this embodiment of the present application may specifically be a chip, and the chip includes: a processing unit and a communication unit, the processing unit may be, for example, a processor, and the communication unit may be, for example, an input/output interface, pins or circuits, etc. The processing unit can execute the computer executable instructions stored in the storage unit, so that the chip in the execution device executes the data processing method described in the above embodiments, or the chip in the training device executes the data processing method described in the above embodiment. Optionally, the storage unit is a storage unit in the chip, such as a register, a cache, etc., and the storage unit may also be a storage unit located outside the chip in the wireless access device, such as only Read-only memory (ROM) or other types of static storage devices that can store static information and instructions, random access memory (RAM), etc.

Specifically, please refer to FIG. 15. FIG. 15 is a schematic structural diagram of a chip provided by an embodiment of the present application. The chip may be represented as a neural network processor NPU 1500, and the NPU 1500 is mounted as a co-processor to the host CPU (Host CPU). ), tasks are allocated by the Host CPU. The core part of the NPU is the arithmetic circuit 1503, which is controlled by the controller 1504 to extract the matrix data in the memory and perform multiplication operations.

In some implementations, the arithmetic circuit 1503 includes multiple processing units (Process Engine, PE). In some implementations, the arithmetic circuit 1503 is a two-dimensional systolic array. The arithmetic circuit 1503 may also be a one-dimensional systolic array or other electronic circuitry capable of performing mathematical operations such as multiplication and addition. In some implementations, arithmetic circuit 1503 is a general-purpose matrix processor.

For example, suppose there is an input matrix A, a weight matrix B, and an output matrix C. The arithmetic circuit fetches the data corresponding to the matrix B from the weight memory 1502 and buffers it on each PE in the arithmetic circuit. The arithmetic circuit fetches the data of matrix A and matrix B from the input memory 1501 to perform matrix operation, and stores the partial result or final result of the matrix in the accumulator 1508 .

Unified memory 1506 is used to store input data and output data. The weight data directly passes through the storage unit access controller (Direct Memory Access Controller, DMAC) 1505, and the DMAC is transferred to the weight memory 1502. Input data is also moved into unified memory 1506 via the DMAC.

The BIU is the Bus Interface Unit, that is, the bus interface unit 1510, which is used for the interaction between the AXI bus and the DMAC and the instruction fetch buffer (Instruction Fetch Buffer, IFB) 1509.

The bus interface unit 1510 (Bus Interface Unit, BIU for short) is used for the instruction fetch memory 1509 to obtain instructions from the external memory, and also for the storage unit access controller 1505 to obtain the original data of the input matrix A or the weight matrix B from the external memory.

The DMAC is mainly used to transfer the input data in the external memory DDR to the unified memory 1506 , the weight data to the weight memory 1502 , or the input data to the input memory 1501 .

The vector calculation unit 1507 includes a plurality of operation processing units, and further processes the output of the operation circuit 1503, such as vector multiplication, vector addition, exponential operation, logarithmic operation, size comparison, etc., if necessary. It is mainly used for non-convolutional/fully connected layer network computation in neural networks, such as Batch Normalization, pixel-level summation, and upsampling of feature planes.

In some implementations, the vector computation unit 1507 can store the vector of processed outputs to the unified memory 1506 . For example, the vector calculation unit 1507 may apply a linear function; or a non-linear function to the output of the operation circuit 1503, such as linear interpolation of the feature plane extracted by the convolution layer, such as a vector of accumulated values, to generate activation values. In some implementations, the vector computation unit 1507 generates normalized values, pixel-level summed values, or both. In some implementations, the vector of processed outputs can be used as activation input to the arithmetic circuit 1503, such as for use in subsequent layers in a neural network.

The instruction fetch buffer (instruction fetch buffer) 1509 connected to the controller 1504 is used to store the instructions used by the controller 1504;

The unified memory 1506, the input memory 1501, the weight memory 1502 and the instruction fetch memory 1509 are all On-Chip memories. External memory is private to the NPU hardware architecture.

Wherein, the processor mentioned in any one of the above can be a general-purpose central processing unit, a microprocessor, an ASIC, or one or more integrated circuits for controlling the execution of the above program.

In addition, it should be noted that the device embodiments described above are only schematic, wherein the units described as separate components may or may not be physically separated, and the components displayed as units may or may not be A physical unit, which can be located in one place or distributed over multiple network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution in this embodiment. In addition, in the drawings of the device embodiments provided in the present application, the connection relationship between the modules indicates that there is a communication connection between them, which may be specifically implemented as one or more communication buses or signal lines.

From the description of the above embodiments, those skilled in the art can clearly understand that the present application can be implemented by means of software plus necessary general-purpose hardware. Special components, etc. to achieve. Under normal circumstances, all functions completed by a computer program can be easily implemented by corresponding hardware, and the specific hardware structures used to implement the same function can also be various, such as analog circuits, digital circuits or special circuit, etc. However, a software program implementation is a better implementation in many cases for this application. Based on this understanding, the technical solutions of the present application can be embodied in the form of software products in essence, or the parts that make contributions to the prior art. The computer software products are stored in a readable storage medium, such as a floppy disk of a computer. , U disk, mobile hard disk, ROM, RAM, magnetic disk or optical disk, etc., including several instructions to make a computer device (which may be a personal computer, training device, or network device, etc.) to execute the various embodiments of this application. method.

In the above-mentioned embodiments, it may be implemented in whole or in part by software, hardware, firmware or any combination thereof. When implemented in software, it can be implemented in whole or in part in the form of a computer program product.

The computer program product includes one or more computer instructions. When the computer program instructions are loaded and executed on a computer, all or part of the processes or functions described in the embodiments of the present application are generated. The computer may be a general purpose computer, special purpose computer, computer network, or other programmable device. The computer instructions may be stored in or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, the computer instructions may be retrieved from a website, computer, training device, or data Transmission from the center to another website site, computer, training facility or data center via wired (eg coaxial cable, fiber optic, digital subscriber line (DSL)) or wireless (eg infrared, wireless, microwave, etc.) means. The computer-readable storage medium may be any available medium that can be stored by a computer, or a data storage device such as a training device, a data center, or the like that includes an integration of one or more available media. The usable media may be magnetic media (eg, floppy disks, hard disks, magnetic tapes), optical media (eg, DVD), or semiconductor media (eg, Solid State Disk (SSD)), and the like.

Claims

An image processing method, characterized in that the method comprises:

Get the depth information of the background area of the current image frame;

Divide the background area into a plurality of sub-areas according to the depth information, and the distances from the objects corresponding to different sub-areas to the camera are different, and the camera is used to capture the current image frame;

Perform different degrees of blurring processing on the different sub-regions to obtain the processed current image frame.
The method according to claim 1, wherein the depth information includes a depth value of each pixel in a background area of the current image frame, and the background area is divided into multiple The sub-areas specifically include:

Determine the depth change rate of each pixel point in the background area of the current image frame according to the depth value of each pixel point in the background area of the current image frame, and the depth change rate of each pixel point is based on the pixel The depth value of the point and the depth values of the remaining pixels around the pixel are determined;

The background area is divided into a plurality of sub-areas according to the depth change rate of each pixel and a preset change rate threshold.
The method according to claim 1 or 2, characterized in that, performing different degrees of blurring on the different sub-regions to obtain the processed current image frame specifically includes:

In the plurality of sub-regions, a motion vector of each sub-region is obtained, and the motion vector of each sub-region is used to indicate the motion of the sub-region relative to the previous image frame;

Perform blur processing on each sub-region according to the motion vector of the sub-region to obtain the processed current image frame.
The method according to claim 3, wherein the motion vector of each sub-area includes the motion speed of the sub-area and the motion direction of the sub-area, and in the plurality of sub-areas, acquiring each sub-area The motion vectors include:

For each sub-area in the plurality of sub-areas, determine the movement speed of the sub-area according to the movement speed of at least one target pixel in the sub-area from the previous image frame to the current image frame;

The movement direction of the sub-region is determined according to the movement direction of the at least one target pixel point from the previous image frame to the current image frame.
The method according to claim 4, wherein the blurring of each sub-region according to the motion vector of the sub-region specifically comprises:

For each sub-area, construct a convolution kernel corresponding to the sub-area according to the movement speed of the sub-area and the movement direction of the sub-area;

Convolution processing is performed on the sub-region through the convolution kernel corresponding to the sub-region.
The method according to claim 4 or 5, wherein the at least one target pixel point is a corner point.
The method according to any one of claims 4 to 6, wherein the movement speed and movement direction of the at least one target pixel point are obtained by an optical flow method.
The method according to any one of claims 3 to 7, wherein the acquiring the depth information of the background region of the current image frame specifically includes:

Obtain the depth value of each pixel in the current image frame and the background area of the current image frame;

From the depth values of all pixels in the current image frame, determine the depth value of each pixel in the background area of the current image frame.
The method according to claim 8, wherein the acquiring the depth value of each pixel in the current image frame specifically comprises:

Obtain the depth value of each pixel in the current image frame through the first neural network.
The method according to claim 8, wherein the camera is a depth camera, and the acquiring the depth value of each pixel in the current image frame specifically comprises:

Obtain the depth value of each pixel in the current image frame through the depth camera.
The method according to any one of claims 8 to 10, wherein the acquiring the background area of the current image frame specifically comprises:

Obtain the background area of the current image frame through the second neural network.
An image processing device, characterized in that the device comprises:

The acquisition module is used to acquire the depth information of the background area of the current image frame;

a dividing module, configured to divide the background area into a plurality of sub-areas according to the depth information, and the distances from the objects corresponding to different sub-areas to the camera are different, and the camera is used for shooting the current image frame;

The processing module is configured to perform different degrees of blurring processing on the different sub-regions to obtain the processed current image frame.
The device according to claim 12, wherein the depth information includes a depth value of each pixel in the background area of the current image frame, and the dividing module is specifically configured to:

Determine the depth change rate of each pixel point in the background area of the current image frame according to the depth value of each pixel point in the background area of the current image frame, and the depth change rate of each pixel point is based on the pixel The depth value of the point and the depth values of the remaining pixels around the pixel are determined;

The background area is divided into a plurality of sub-areas according to the depth change rate of each pixel and a preset change rate threshold.
The device according to claim 12 or 13, wherein the processing module is specifically configured to:

In the plurality of sub-regions, a motion vector of each sub-region is obtained, and the motion vector of each sub-region is used to indicate the motion of the sub-region relative to the previous image frame;

Perform blur processing on each sub-region according to the motion vector of the sub-region to obtain the processed current image frame.
The device according to claim 14, wherein the processing module is specifically configured to:

For each sub-area in the plurality of sub-areas, determine the movement speed of the sub-area according to the movement speed of at least one target pixel in the sub-area from the previous image frame to the current image frame;

The movement direction of the sub-region is determined according to the movement direction of the at least one target pixel point from the previous image frame to the current image frame.
The device according to claim 15, wherein the processing module is specifically configured to:

For each sub-area, construct a convolution kernel corresponding to the sub-area according to the movement speed of the sub-area and the movement direction of the sub-area;

Convolution processing is performed on the sub-region through the convolution kernel corresponding to the sub-region.
The device according to claim 15 or 16, wherein the at least one target pixel point is a corner point.
The device according to any one of claims 15 to 17, wherein the movement speed and movement direction of the at least one target pixel point are obtained by an optical flow method.
The device according to any one of claims 15 to 18, wherein the acquisition module is specifically configured to:

Obtain the depth value of each pixel in the current image frame and the background area of the current image frame;

From the depth values of all pixels in the current image frame, determine the depth value of each pixel in the background area of the current image frame.
The apparatus according to claim 19, wherein the obtaining module is specifically configured to obtain the depth value of each pixel in the current image frame through the first neural network.
The device according to claim 19, wherein the camera is a depth camera, and the acquiring module is specifically configured to acquire the depth value of each pixel in the current image frame through the depth camera.
The apparatus according to any one of claims 19 to 21, wherein the acquiring module is specifically configured to acquire the background area of the current image frame through the second neural network.
An image processing apparatus, characterized in that it includes a memory and a processor; the memory stores code, the processor is configured to execute the code, and when the code is executed, the image processing apparatus executes the following steps: The method of any one of claims 1 to 11.
A computer storage medium, characterized in that the computer storage medium stores a computer program, which, when executed by a computer, causes the computer to implement the method of any one of claims 1 to 11.
A computer program product, characterized in that the computer program product stores instructions that, when executed by a computer, cause the computer to implement the method of any one of claims 1 to 11 .