CN112673380A

CN112673380A - Image processing method, device, movable platform and system

Info

Publication number: CN112673380A
Application number: CN202080004902.6A
Authority: CN
Inventors: 李恒杰; 赵文军
Original assignee: SZ DJI Technology Co Ltd
Current assignee: SZ DJI Technology Co Ltd
Priority date: 2020-05-28
Filing date: 2020-05-28
Publication date: 2021-04-16
Also published as: WO2021237555A1

Abstract

A method, an apparatus, a movable platform and a system for image processing are provided. The image processing method comprises the following steps: processing each frame of image in a plurality of frames of images by adopting a CNN structure to obtain a multi-channel characteristic image of each frame of image; processing the multi-channel feature map of the multi-frame image by adopting an RNN structure to obtain a single-channel saliency map of each frame of image; and determining the position of the region of interest of each frame of image according to the corresponding saliency map of each frame of image. The image processing method, the image processing device, the movable platform and the image processing system can determine the position of the ROI more accurately, reduce system time delay and improve system real-time performance.

Description

Image processing method, device, movable platform and system

Copyright declaration

The disclosure of this patent document contains material which is subject to copyright protection. The copyright is owned by the copyright owner. The copyright owner has no objection to the facsimile reproduction by anyone of the patent document or the patent disclosure, as it appears in the patent and trademark office official records and records.

Technical Field

The present application relates to the field of image processing, and in particular, to a method, an apparatus, a movable platform, and a system for image processing.

Background

In image transmission applications, captured images and videos are usually transmitted in real time, which requires a large bandwidth. In order to reduce the occupation of image transmission resources, the image can be blurred in a filtering manner. For example, the original values of the pixel points may be maintained for a region of interest (ROI) in the image, and the high frequency information may be reduced by mean filtering or gaussian filtering based on the same or different filtering radii for other regions.

In the above scheme, if the ROI position cannot be determined timely and accurately, human eyes at the ground end cannot observe a high-quality image but observe a blurred image. Therefore, how to accurately determine the ROI position is a problem to be solved.

Disclosure of Invention

The application provides an image processing method, an image processing device, a movable platform and an image processing system, which can more accurately determine the position of an ROI, reduce the time delay of the system and improve the real-time performance of the system.

In a first aspect, a method for image processing is provided, including: processing each frame of image in a plurality of frames of images by adopting a Convolutional Neural Network (CNN) structure to obtain a multichannel characteristic diagram of each frame of image; processing the multi-channel feature map of the multi-frame image by adopting a Recurrent Neural Network (RNN) structure to obtain a single-channel saliency map of each frame image; and determining the position of a region of interest (ROI) of each frame of image according to the corresponding saliency map of each frame of image.

In a second aspect, an apparatus for image processing is provided, which is configured to perform the method of the first aspect or any possible implementation manner of the first aspect. In particular, the apparatus comprises means for performing the method of the first aspect described above or any possible implementation manner of the first aspect.

In a third aspect, an apparatus for image processing is provided, including: a storage unit for storing instructions and a processor for executing the instructions stored by the memory, and when the processor executes the instructions stored by the memory, the execution causes the processor to perform the first aspect or the method of any possible implementation of the first aspect.

In a fourth aspect, there is provided a computer readable medium for storing a computer program comprising instructions for carrying out the method of the first aspect or any possible implementation manner of the first aspect.

In a fifth aspect, there is provided a computer program product comprising instructions which, when executed by a computer, cause the computer to perform the method of image processing of the first aspect or any possible implementation form of the first aspect. In particular, the computer program product may be run on an apparatus for image processing of the third aspect described above.

In a sixth aspect, there is provided a movable platform comprising: a body; the power system is arranged in the machine body and used for providing power for the movable platform; and one or more processors configured to perform the method of image processing of the first aspect or any possible implementation manner of the first aspect.

In a seventh aspect, a system is provided, which includes the movable platform of the above sixth aspect and a display device, wherein the movable platform is connected with the display device by wire or wirelessly.

Drawings

Fig. 1 is a schematic block diagram of an image processing system.

Fig. 2 is a schematic diagram of the ROI position in the image in the embodiment of the present application.

Fig. 3 is a schematic diagram of the data flow in the image processing system of fig. 1.

Fig. 4 is a schematic block diagram of an image processing system of an embodiment of the present application.

Fig. 5 is a schematic flowchart of a method of image processing according to an embodiment of the present application.

Fig. 6 is a schematic diagram of a method flow of image processing according to an embodiment of the present application.

Fig. 7 is a schematic diagram of processing an image by using a CNN structure according to an embodiment of the present application.

FIGS. 8-9 are general schematic diagrams of RNN structures.

Fig. 10 is a general schematic of the structure of LSTM.

Fig. 11 is a schematic diagram of processing an image by using an RNN structure according to an embodiment of the present application.

Fig. 12 is a schematic diagram of a saliency map of a heat distribution concentration of an embodiment of the present application.

Fig. 13 is a schematic diagram of a saliency map of the heat distribution dispersion of an embodiment of the present application.

Fig. 14 is a schematic block diagram of an image processing apparatus according to an embodiment of the present application.

Fig. 15 is a schematic block diagram of another image processing apparatus according to an embodiment of the present application.

FIG. 16 is a schematic block diagram of a movable platform of an embodiment of the present application.

FIG. 17 is a schematic view of an unmanned aerial vehicle system of an embodiment of the application.

Detailed Description

The technical solutions in the embodiments of the present application will be described below with reference to the accompanying drawings.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used herein in the description of the present application is for the purpose of describing particular embodiments only and is not intended to be limiting of the application.

In the above scheme, generally, an eyeball tracking device and method are adopted to determine the center position of the ROI of each frame in real time, so that a progressive filtering scheme is adopted outside the ROI area according to the position and size of the ROI, and a smooth blurring effect and a great bandwidth compression are realized. In particular, fig. 1 shows a schematic block diagram of an image processing system 100. As shown in fig. 1, the system 100 may include: a lens capturing module 110, an Image Signal Processing (ISP) module 120, a blurring (sharpening) module 130, an encoding and transmitting module 140, a receiving and decoding module 150, and an eye tracking module 160.

As shown in fig. 1, the image captured by the lens capturing module 110 is first processed by the ISP module 120. Specifically, the image collected by the lens collection module 110 may be converted into an electrical signal by a photosensitive element, and then transmitted to the ISP module 120 for processing, so as to be converted into a digital image. The lens capturing module 110 may refer to a camera, and correspondingly, the image captured by the lens capturing module 220 may refer to one or more frames of images captured by the camera through photographing, or may refer to a video image captured through video recording, which is not limited in the present application.

The image signal output by the ISP module 120 is input to the blurring sharpening module 130 in combination with the position information of the ROI, so as to perform corresponding processing on different regions in the video image. Specifically, in machine vision and image processing, the processed image outlines the region to be processed, called ROI, in the form of a box, circle, ellipse, irregular polygon, etc. The ISP module 120 may refer to a processor or a processing circuit, and the embodiments of the present application are not limited thereto.

The blurring sharpening module 130 performs sharpen and blu operations inside and outside the ROI according to the position of the ROI in combination with bandwidth requirements, so that the image video quality in the ROI is improved, and the bandwidth occupation during image transmission is greatly reduced. For example, FIG. 2 is a schematic diagram of the location of an ROI in an image. As shown in fig. 2, the reason why the ROI is divided into several layers from inside to outside in order of sharp region inside and Blur region outside is to realize the progressive blurring effect. That is, sharp operation is performed in the ROI area, so that image details are enhanced, and image quality is improved; and the outer side of the ROI is gradually blurred from inside to outside, so that the effect of smoothly blurring can be achieved.

The blurring sharpening module 130 sends the processed image to the coding and transmission module 140 for coding, and transmits the coded image from the unmanned aerial vehicle end to the ground end; at the ground end, the receiving and decoding module 150 receives and processes the encoded data and displays the processed data to the user, so that the user can observe a local high-quality image, and meanwhile, the eye tracking module 160 obtains a region of interest, i.e., an ROI, according to the movement of human eyes, so that the ROI position information is transmitted to the blurring sharpening module 130. The encoding and transmitting module 140 may include an encoder and may further include a transmitter; the receiving and decoding module 150 may include a decoder and may also include a receiver, but the embodiment of the present application is not limited thereto.

In the system 100 shown in FIG. 1, the ROI position in the system 100 is provided by the devices and algorithms included in the eye tracking module 160, since the blurring sharpening module 130 needs to determine the algorithm application range according to the ROI position. Specifically, fig. 3 is a schematic diagram of data flow in the system 100, and as can be seen from the data flow shown in fig. 3, there are multiple time delays in the system 100, for example, graph round trip delay (i.e., encoding, transmitting, decoding, and uplink feedback processes included in D1), eye tracking algorithm delay (i.e., D2 and D3), Blur and sharp processing algorithm delay (i.e., D4), and so on. Because the eyeball tracking module 160 is located at the ground end, the distance between the unmanned aerial vehicle and the ground end is long, and the transmission time is long, so that the time delays D1, D2 and D3 related to the unmanned aerial vehicle are large, and the time delay of the system 100 is also high. Moreover, due to the existence of the time delay, the position of the ROI may be misaligned, so that human eyes at the ground end cannot observe a high-quality image but observe a blurred image.

Aiming at the problems, the application provides a method and a system for image processing, which accurately predict the central position of the ROI based on a visual attention prediction scheme of deep learning, thereby avoiding the time delay process and improving the real-time property of the whole system; meanwhile, eyeball tracking equipment can be omitted, and practicability and transportability are improved.

Fig. 4 shows a schematic block diagram of an image processing system 200 of an embodiment of the present application. As shown in fig. 4, the system 200 includes a lens collection module 210, an ISP module 220, a visual saliency prediction module 230, a blurring (Sharpen) module 240, an encoding and transmission module 250, and a receiving and decoding module 260.

Similar to the system 100 shown in fig. 1, as shown in fig. 4, the captured image is first processed by the lens capture module 210 in the system 200 through the ISP module 220. The image signal output after the processing by the ISP module 220 is input to the blurring module 240 in combination with the position information of the ROI, so as to perform corresponding processing on different regions in the video image.

However, unlike the system 100 shown in fig. 1, the ROI position information in the system 200 is determined by the visual saliency prediction module 230 based on the result of the deep learning, i.e., the position of the ROI, for example, the ROI center position, is output by the visual saliency prediction module 230 as a parameter of the blurring sharpening module 240. Then, similar to the system 100 shown in fig. 1, the data is encoded by the encoding and transmitting module 250 and then transmitted to the ground end; and processed by the receiving and decoding module 260 for display to the user.

It should be understood that the lens capture module 210 may be applicable to the description related to the lens capture module 110, for example, the lens capture module 210 may be a camera; the ISP module 220 may be adapted to the relevant description of the ISP module 120, e.g., the ISP module 220 may be an ISP module processing circuit; blurring sharpening module 240 may be adapted to blur the associated description of sharpening module 130; the encoding and transmitting module 250 may be adapted to encode the description related to the transmitting module 140, and the receiving and decoding module 260 may be adapted to receive the description related to the decoding module 150, for example, the encoding and transmitting module 250 may include an encoder and a transmitter, and the receiving and decoding module 260 may include a receiver and a decoder, which are not described herein again for brevity.

Therefore, the system 200 of the embodiment of the present application determines the ROI by the video saliency prediction module based on deep learning, and can predict the visual attention area of the video in real time from end to end without manually designing features and performing complex calculation, thereby avoiding the delay problem caused by the eyeball tracking scheme. In an embodiment, the video saliency prediction module is a processing circuit.

The process by which the visual saliency prediction module 230 determines the ROI is described in detail below.

Fig. 5 shows a schematic flow diagram of a method 300 of image processing of an embodiment of the present application. Alternatively, the method 300 may be performed by an image processing device, for example, the image processing device may be the above-mentioned visual saliency prediction module 230, but the embodiment of the present application is not limited thereto.

As shown in fig. 5, the method 300 includes: s310, processing each frame of image in a multi-frame image by adopting a Convolutional Neural Network (CNN) structure to obtain a multi-channel feature map (feature-map) of each frame of image; s320, processing the multi-channel feature map of the multi-frame image by adopting a Recurrent Neural Network (RNN) structure to obtain a single-channel saliency map of each frame of image; s330, determining the ROI position of each frame of image according to the saliency map corresponding to each frame of image.

Deep learning has originated from the study of neural networks. In 1943, in order to understand the working principle of the brain and further design artificial intelligence, a simplified version of the concept of brain cells, neurons, was first proposed. Thereafter, neural networks were proposed and rapidly developed on this basis. With the proposal of various Network structures such as Convolutional Neural Network (CNN), the number of Network parameters is greatly reduced, and the training speed and precision are obviously improved, so that the Neural Network is rapidly developed and widely applied in the image field.

For example, human beings can quickly select key parts in the visual field, selectively allocate visual processing resources to the visually significant areas, and accordingly, in the field of computer vision, the attention mechanism for understanding and simulating the human visual system is greatly concerned by the academia and shows wide application prospects. In recent years, with the enhancement of computing power and the establishment of large-scale saliency detection data sets, deep learning techniques are gradually becoming the main means for visual attention mechanism calculation and modeling.

In the field of saliency detection, there is much work on how to simulate the visual attention mechanism of human beings when viewing images. However, the current visual saliency prediction scheme based on deep learning mainly focuses on the field of static images, and the main reason why the research on video sequences is less is that the extraction of inter-frame motion information of the video sequences requires the design of features and a large amount of calculation, so that the progress of video saliency prediction is slow. That is, there is relatively little research currently on how humans allocate visual attention in dynamic scenes, but dynamic visual attention mechanisms are more common and important in human daily behavior.

The model used in the research work of dynamic human eye focus detection detects the visual attention in a dynamic scene by combining the static saliency characteristics with time domain information (such as an optical flow field, time domain difference and the like), wherein most of the work can be regarded as expansion after considering motion information on the basis of the existing static saliency model. These models rely heavily on feature engineering, and thus the performance of the model is limited by the manually designed features.

The image processing method according to the embodiment of the present application combines the CNN and the RNN to use the CNN in consideration of different characteristics of different network structures. Specifically, the CNN has a main functional characteristic that the extraction capability of high-dimensional features is very strong, and is widely applied to the field of image vision, such as actual applications of target detection, face recognition and the like, and has produced great success. Compared with the traditional detection algorithms, the deep learning algorithms such as CNN do not need to artificially select features, but extract the features in a mode of learning and training a network, and then generate the subsequent decision result by the extracted features, thereby realizing the functions of classification, identification and the like.

The RNN has the main characteristics that the time sequence information in data and the deep expression capability of semantic information can be mined, and breakthrough is realized in the aspects of speech recognition, language models, machine translation, time sequence analysis and the like. That is, a recurrent neural network is used to process and predict sequence data.

Therefore, in the image processing method according to the embodiment of the present application, the acquired video sequence can be processed by using the powerful feature extraction capability of the CNN on the image and matching the processing capability of the RNN on the time sequence, so as to obtain the saliency region of each frame of image, and further determine the position of the ROI. Therefore, time delay caused by an eyeball tracking module at the ground end, a graph transmission process and the like in the conventional system can be avoided, and the real-time performance of the original system is greatly improved; the practicability and the portability of the system are improved.

The method 300 of the present embodiment will be described in detail below with reference to fig. 6-11.

Fig. 6 shows a schematic diagram of a flow of a method 300 of image processing. As shown in fig. 6, a multi-frame image to be processed, which may refer to a multi-frame image in any video data to be processed, is first acquired. Specifically, the video data to be processed may refer to data output by the ISP module 220 in the system 200 shown in fig. 4, for example, a video sequence in a yuv format output by the ISP module 220. For example, the video data to be processed may include T frame images, i.e., 1 st frame image to T frame image in fig. 6, where the T frame image represents any one of the frame images.

Inputting each frame of image to be processed into a CNN structure respectively, processing a plurality of frames of images by adopting the CNN structure, and correspondingly outputting a multi-channel feature map of each frame of image; inputting the multichannel feature map into an RNN structure, for example, in fig. 6, taking a Long short-term memory (LSTM) structure as an example, the multichannel feature map of each frame of image is correspondingly output; and finally, obtaining a saliency map of each frame of image after merging, wherein the position of the ROI in each frame of image can be obtained through the saliency map.

The forward propagation process of the CNN structure is first described below: that is, in S310, a CNN structure is adopted to process each frame of image in multiple frames of images to obtain a multi-channel feature map of each frame of image. Specifically, the S310 may include: s311, performing continuous convolution and/or pooling operation on each frame of image to obtain a plurality of spatial feature maps of each frame of image, wherein the plurality of spatial feature maps have different resolutions; s312, performing deconvolution and/or convolution operations on each of the plurality of spatial feature maps to obtain a plurality of single-channel feature maps of each frame of image, where the plurality of single-channel feature maps have the same resolution; s313, the multiple single-channel feature maps of each frame of image are combined into a multi-channel feature map of each frame of image.

In the embodiment of the present application, fig. 7 shows a schematic diagram of processing an image by using a CNN structure. As shown in fig. 7, in S311, for each frame of video image (e.g., the image 400 in fig. 7), a spatial feature map (e.g., the image obtained after the

operations

430 and 450 in fig. 7) of each frame of video image is extracted through successive convolution and pooling operations (e.g., the

operations

410 and 450 in fig. 7), where the spatial feature maps have different resolutions.

Specifically, in the convolution pooling part of the CNN structure, the neural network structure may be reasonably selected according to the needs of practical applications, for example, the network structure of the convolution layer part of the pretrained VGG-16 neural network is taken as an example in the present application, and the VGG-16 neural network may well extract the spatial features of each frame of image, but the present application is not limited thereto. For example, in addition to the convolution portion of the VGG-16 network, other convolution blocks of the deep convolutional neural network of the same level may be selected instead, such as ResNet, google net, etc., depending on the particular problem.

VGG-Net is a new deep convolutional neural network developed by the Oxford university computer vision Group (Visual Geometry Group) et al. VGG-Net explores the relation between the depth of convolutional neural network and its performance, has successfully constructed 16 ~ 19 layer deep convolutional neural network, has proven that increasing the depth of network can influence the ultimate performance of network to a certain extent, makes the error rate descend by a wide margin, and expansibility is very strong simultaneously, and is also very good on migrating to other picture data. Currently, VGG is widely used to extract image features.

In the embodiment of the present application, as shown in fig. 7, a network structure of a convolution layer portion of a pre-trained VGG-16 neural network may include 5 groups of 13 layers of convolutions in a convolution pooling processing portion of the CNN structure to obtain multiple spatial feature maps of each frame of image, for example, in the embodiment of the present application, three spatial feature maps of each frame of image are obtained, and the three spatial feature maps have different resolutions.

Specifically, taking an image 400 input in any frame as an example, it is assumed that the resolution of the image 400 is w × h, and the image 400 may have multiple channels, for example, the image 400 may be an image of any frame in a video sequence in a yuv format output by the ISP module 220 shown in fig. 4, and has three channels. The image 400 undergoes a series of convolution and/or pooling operations, for example, as shown in FIG. 7, which may in turn undergo

operation

410 and 450. Where operation 410 includes two convolution operations, the resolution is still w × h; operation 420 includes two convolution and one pooling, where the pooling operation may reduce the resolution to

Operation 430 includes three convolution and one pooling, where the pooling operation may reduce the resolution to

Outputting a spatial signature graph after the operation 430; the image from operation 430 is passed through operation 440, which operation 440 includes three convolution and one pooling, which pooling may reduce the resolution to

Outputting a spatial signature graph after the operation 440; the image from operation 440 is passed through operation 450, which includes three convolution and one pooling 450, wherein the pooling may reduce the resolution to

The last spatial signature is output after this operation 450. The three spatial feature maps with different resolutions can be output through operations 410-450, the channel numbers thereof) are 256, 512 and 512, respectively; in addition, the number of channels of feature-map for 5 sets of scrolling operations (410- & ltwbr/& gt450) that have passed from the input image 400 is: 64. 128, 256, 512 and 512.

It should be understood that the three resolutions to be obtained here are respectively

The number and the resolution of the spatial feature maps obtained in the embodiment of the present application may be set according to practical applications, for example, feature maps with other resolutions may be selected, or a greater or lesser number of spatial feature maps may also be obtained.

In S312, each of the spatial feature maps with different resolutions obtained after S311 is subjected to deconvolution and/or convolution operations to obtain multiple single-channel feature maps of each frame image, where the multiple single-channel feature maps have the same resolution. Specifically, deconvolution operation may be performed on each of the obtained multiple spatial feature maps to obtain multiple feature maps with the same resolution; performing a convolution operation on each of the plurality of feature maps to obtain a plurality of single-channel feature maps (e.g., the images respectively obtained after the

operations

460 and 480 in fig. 7).

It should be understood that the resolution sizes of the output five-piece spatial feature map subjected to the 5 sets of convolutions (5 convolution blocks) included in S311 are 1, 1/2, 1/4, 1/8, and 1/16 times the resolution size of the input video image 400, respectively. In order to obtain a pixel-wise (pixel-wise) resolution size salency map, the image feature-map obtained in S311 needs to be upsampled, and the deconvolution layer is provided to increase the resolution by upsampling.

Specifically, taking the three spatial feature maps with different resolutions obtained in S311 as an example, three deconvolution modules (corresponding to

operations

460 and 480 in fig. 7) need to be correspondingly arranged to respectively upsample the three spatial feature maps finally and respectively output by the last three sets of convolution modules in S311. The reason for selecting the last three groups of convolution modules is that the characteristics of the layers at higher levels are fused, so that richer spatial characteristics can be obtained comprehensively, and the accuracy of the final significance prediction is improved.

In the embodiment of the present application, the deconvolution step size is set to 2 as an example, which means that each layer of deconvolution can expand the resolution by 2 times. The resolution of the three spatial feature maps output in S311 is respectively

Therefore, the three subsequent deconvolution modules respectively comprise 2, 3 and 4 deconvolution layers so as to obtain a plurality of feature maps with the resolution of w × h. Meanwhile, each feature map in the w × h feature maps with the same resolution is still a multi-channel feature map at this time, so that a layer of 1 × 1 convolutional layer is finally attached to each deconvolution module, each feature map in the feature maps is fused, a plurality of single-channel feature maps are output, and the data volume and the calculation amount in subsequent modules can be greatly reduced.

Specifically, in operation 460, the resolution to be output after operation 430 is

The feature map of (a) is deconvoluted twice, and finally, a layer of convolution layer of 1 × 1 is followed to obtain a single-channel feature map with resolution of w × h, where the 2-layer deconvolution layer and the 1-layer 1 × 1 convolution included in operation 460 respectively have the following channels: 64. 32 and 1. In operation 470, the resolution output after operation 440 is set to

Is deconvoluted three times, followed by a layer of 1x1 convolutional layers to obtain another single-channel feature map with resolution of w × h, wherein operation 470 comprises convolving 3 layers of deconvolution layers with 1 layer of 1x1, corresponding to the output feature mapThe number of channels is respectively: 128. 64, 32 and 1. In operation 480, the resolution output after operation 450 is set to

The feature map of (a) is deconvoluted for four times, and finally, a layer of convolution layer of 1 × 1 is followed to obtain another single-channel feature map with resolution of w × h, where the operation 480 includes 4 layers of deconvolution layers and 1 layer of 1 × 1 convolution, and the number of channels corresponding to the output feature map is: 256. 128, 64, 32 and 1.

It should be understood that fig. 7 only illustrates the case of obtaining three single-channel feature maps, but the embodiment of the present application is not limited to this, for example, a greater or lesser number of single-channel feature maps may be provided according to practical applications.

In S313, the multiple single-channel feature maps of each frame image obtained after S311 and S312 are combined into a multi-channel feature map of each frame image (e.g., operation 490 in fig. 7). For example, as shown in fig. 7, three single-channel feature maps obtained after the

operations

460 and 480 are performed may be combined into a three-channel feature map.

It should be understood that the obtained multi-channel feature map of each frame image is taken as input to the next RNN structure according to S311-S313 described above. For example, as shown in FIG. 7, the final operation 490 combines the outputs from the 3 deconvolution modules to produce a three-channel signature that is both the output of the CNN structure and the input to the RNN structure at each instant in time. Specifically, in S320, the RNN structure is adopted to process the multi-channel feature map of the multiple frames of images to obtain a single-channel saliency map of each frame of image.

The general form of RNN is described below in conjunction with fig. 8 and 9. As shown in FIG. 8, for a general RNN structure, x^tFor input at the current time t, h^t-1A status input representing the last node received; y is^tFor output at the present moment, h^tIs the state output for passing to the next node. When the input is a sequence, for example a sequence of images of an input video, then the RNN shown in FIG. 9 can be obtainedIn the expanded form.

In consideration of solving the problems of gradient disappearance and gradient explosion in long sequence training, the RNN structure used in the embodiment of the present application is exemplified by LSTM, but the embodiment of the present application is not limited thereto. LSTM is a special RNN, simply stated, and is able to perform better in longer sequences than normal RNNs.

The general form of LSTM is shown in FIG. 10, with only one delivery state h compared to RNN^tThe LSTM has two transmission states, one is a cell state c_tOne is hidden state (hidden state) h_t. Specifically, the LSTM contains three gating signals: input gate, forget gate, output gate. Wherein the input gate is according to x_tAnd h_t-1Deciding which information to add to state c_t-1To generate a new state c_t(ii) a The forgetting gate has the function of leading the recurrent neural network to forget c_t-1Information that is not useful in the process; the output gate will be based on the latest state c_tOutput h of the previous moment_t-1And the current input x_tTo determine the output h at that moment_t。

The process of forward propagation of the LSTM and the equations for the individual gating signals are defined as follows:

z＝tanh(W_z[h_t-1,x_t]) … … input value

i＝sigmoid(W_i[h_t-1,x_t]) … … input door

f＝sigmoid(W_f[h_t-1,x_t]) … … forget door

o＝sigmoid(W_o[h_t-1,x_t]) … … output gate

c_t＝f·c_t-1+ i z … … New State

h_t＝o·tanh c_t… … output

In the embodiment of the present application, as shown in fig. 6, for each frame of image in a video, a multichannel feature map is obtained after inputting to a CNN structure; at this time, the multi-channel feature maps corresponding to each frame of image are sequentially input to the corresponding time of the LSTM, and thenThe output of the multi-channel is obtained, and finally, the convolution layer of 1x1 is connected to the output of the multi-channel, and the single-channel saliency map of each frame image is finally obtained. Taking the t frame image at the t moment as an example, wherein t can be any positive integer, inputting the multichannel feature map of the t frame image after passing through the CNN structure into the LSTM structure at the t moment, and combining the cell state c output at the t-1 moment_t-1And hidden state h_t-1The processed feature map of the multi-channel corresponding to the t-th frame image is output, and the cell state c at the t-th time is also output_tAnd hidden state h_t。

Specifically, as shown in fig. 11, taking the three-channel feature map corresponding to the image 400 output by the operation 490 in fig. 7 as an example, where the three-channel feature map is the input 500 in fig. 11, the three-channel feature map 500 corresponding to each frame of image is sequentially input to the corresponding time of the LSTM, and after passing through the LSTM structure, the three-channel feature map 510 is output, where the resolution is still w × h, and finally, a convolution layer of 1 × 1 is followed, so as to finally obtain a single-channel saliency map 520 of each frame of image.

It should be understood that the size (size) of the cyclic layer of the LSTM in the embodiment of the present application may be set according to practical applications, and may be set to any value. For example, the size of the loop layer of the LSTM may be set to 10, or the sequence length of the training data may be 10. During training, continuous 10 frames of images are input in each iteration, the spatial features are extracted through a CNN structure, the temporal features are extracted through an LSTM, and finally the spatial-temporal features are synthesized to generate a saliency map of a video sequence.

Alternatively, if the visual attention prediction module in the embodiment of the present application processes data in YUV format, the training data should select a video saliency calibration data set in YUV format. For example, the training data used may be a Simon Fraser University (SFU) human eye tracking common video dataset. The data set is a calibrated human eye saliency video data set and is in a YUV format. The training set, the verification set and the test set can be divided according to the proportion of 8:1: 1.

It should be understood that the data in YUV format is described as an example in the embodiment of the present application, and the YUV format may include YUV444, YUV422, YUV420, and the like. In the two data formats of YUV422 and YUV420, there is a down-sampling operation for the UV component, which results in inconsistent resolution of data in each channel. At this time, the up-sampling operation can be performed on the two UV channels, so that the resolutions of the three YUV channels are the same. For example, the up-sampling method can select a bilinear difference method, and three channels keep the same resolution through the up-sampling process and are used as the input of the visual attention prediction module network. Or the Y channel can be downsampled to unify the three channels to the resolution of UV, so that the problem of different YUV three-channel resolutions is solved.

Optionally, to further improve the real-time performance of the system and reduce the system delay, the following alternatives may be adopted: (1) during training and use of the network, the size of an input image is downsampled (downsampling generally does not affect the distribution and motion information of objects in a scene), so that the data volume calculated by the network is greatly reduced, and the speed is improved; (2) for the YUV video format adopted by the embodiment of the application, the brightness information represented by the Y channel is considered, most object types and motion information are contained, and human eyes are most sensitive to the brightness information, so that training and prediction can be performed only on the Y channel, the data size is reduced, and the real-time performance is improved. If the two real-time promotion operations are adopted, 1 frame of delay can be achieved, the real-time performance of the system is greatly promoted, and interactive experience is promoted.

In the embodiment of the present application, through the processing of the CNN structure and the RNN structure, for the obtained saliency map of each frame of image, in S330, the ROI position of each frame of image is determined according to the saliency map corresponding to each frame of image.

Specifically, the value of all positions in the saliency map ranges from 0 to 1, the numerical value represents a predicted value of the attention degree of the human eyes to the area, and the brighter the numerical value is, the higher the possibility that the human eyes pay attention to the positions is. The degree of concentration of the distribution of heat (i.e., the degree of attention of the human eyes) differs according to the category information and the motion information of the objects in different scenes. For example, the heat in FIG. 12 is more focused, and the object class and motion information is evident; while the heat in fig. 13 is dispersed, there are no obvious objects and motions in the scene, and the picture is relatively flat.

In view of the above analysis of the heat distribution of the saliency maps, the ROI position of each saliency map may be determined in different ways, where the ROI position may include the center position of the ROI and the range of the ROI. Specifically, for determining the center position of the ROI, various ways may be employed. For example, determining the center position of the ROI may include: the position where the pixel value is maximum in the saliency map is output as the ROI center coordinate, for example, may be output to the blurring sharpening module 240. Alternatively, to reduce random errors, determining the center position of the ROI may further include: determining coordinates of a plurality of points of which the pixel values are greater than or equal to a first preset value in each frame of saliency map; the average of the coordinates of the plurality of points is determined as the center coordinate output of the ROI of each frame of image, and may be output to the blurring sharpening module 240, for example. The first preset value may be set according to practical applications, and may be set to any value between the significance image pixel

value distribution range

1 and 0, for example, may be set to 0.8, but the embodiment of the present application is not limited thereto.

In the two manners of determining the center coordinates of the ROI, the second manner is to reduce the random error of the distribution of the pixel points in the saliency map by averaging, so that the obtained ROI region is more accurate. Or the center position of the ROI may be determined in other manners, but the embodiment of the present application is not limited thereto.

For determining the ROI range, various ways may be employed as well. For example, determining the extent of the ROI may include: and determining the size of the ROI according to the size of each frame of image. For example, the size of the ROI may be set to a preset multiple (e.g., 1/4) of the size of the image per frame, such as by setting the length and width of the ROI to half the size of the image, respectively.

However, in a scene with a wide heat distribution, the possible attention range of the human eye is large, and therefore, the size of the ROI may be adjusted. That is, determining the range of the ROI may further include: determining coordinates of a plurality of points of which the pixel values are greater than or equal to a second preset value in the saliency map corresponding to each frame of image; determining two points from the plurality of points, wherein the absolute value of the difference between the abscissa and/or the ordinate of the two points is the maximum; the size of the ROI is determined based on the absolute value of the difference between the abscissas and/or the absolute value of the difference between the ordinates of the two points.

In particular, the size of the ROI may be adjusted according to at least one of the following steps: if the absolute value of the horizontal coordinate difference of the two points is greater than or equal to the preset length, determining the length of the ROI of each frame of image according to the ratio of the absolute value of the horizontal coordinate difference of the two points to the preset length, for example, expanding the preset length to the length of the ROI according to the ratio; if the absolute value of the difference value of the horizontal coordinates of the two points is smaller than the preset length, determining the preset length as the length of the ROI of each frame of image; if the absolute value of the difference between the vertical coordinates of the two points is greater than or equal to the preset width, determining the width of the ROI of each frame of image according to the ratio of the absolute value of the difference between the vertical coordinates of the two points to the preset width, for example, expanding the preset width to be the width of the ROI according to the ratio; and if the absolute value of the difference value of the vertical coordinates of the two points is smaller than the preset width, determining the preset width as the width of the ROI of each frame of image.

It should be understood that the second preset value may be set according to practical applications, and may be set to any value between the range of the distribution of the saliency image pixel values 1 to 0, for example, may be set to 0.7, but the embodiment of the present application is not limited thereto.

In this way, according to a set second preset value, the distribution of all the points of which the pixel values are greater than or equal to the second preset value in the saliency map is calculated, and the range of heat distribution is described by calculating the absolute value of the difference between the horizontal coordinates and the vertical coordinates of every two points in the points meeting the requirement of the second preset value; for example, the absolute values of the differences between the horizontal and vertical coordinates may be compared with the default ROI size, and if the absolute values are larger than the default ROI size, the ROI size may be increased; if the size is smaller than the default size of the ROI, the default size of the ROI can be selected to be reduced or directly adopted. Wherein the size of the ROI may be determined to be enlarged or reduced based on a ratio of the abscissa and the ordinate to a default size of the ROI. Through the process, the position of the ROI can be determined more accurately according to actual conditions, and the size of the ROI is adjusted according to different scenes, so that experience is improved.

Therefore, in the image processing method according to the embodiment of the present application, considering that, in order to solve the bandwidth problem, the region outside the ROI is blurred through filtering, so as to reduce high frequency information of the image, improve the compression rate, and finally reduce the bandwidth, so that the determination of the ROI position is particularly important for the image processing process. In order to obtain the position of the ROI, an eye tracking design and algorithm are generally used to detect and provide the position, but this causes a great delay, so that the position actually observed by human eyes is misaligned with the obtained ROI position, and thus a high-quality video image cannot be observed. In the embodiment of the application, the visual attention prediction model based on deep learning is adopted to replace eyeball tracking equipment, the region of interest of human eyes can be predicted in real time according to the content of a video, a series of delay processes such as eyeball tracking and the like can be avoided, the real-time performance and the practicability of the system are improved, and the transportability among platforms in the system is improved.

In addition, considering that the current visual saliency prediction scheme based on deep learning is mainly applied to processing of still images, when applied to video sequences, the extraction of inter-frame motion information of the video sequences requires design of features and a large amount of calculation, which results in slow progress of video saliency prediction. Therefore, in the embodiment of the application, based on deep learning, intra-frame spatial information and inter-frame motion information (time information) are respectively extracted through a calibrated large-scale video saliency data set and a network model combining CNN and RNN, so that the spatio-temporal characteristics of a video sequence are obtained, and end-to-end video saliency prediction is realized.

It should be understood that, in the various embodiments of the present application, the sequence numbers of the above-mentioned processes do not mean the execution sequence, and the execution sequence of each process should be determined by its function and inherent logic, and should not constitute any limitation to the implementation process of the embodiments of the present application.

The method of image processing according to the embodiment of the present application is described in detail above with reference to fig. 1 to 13, and the apparatus of image processing according to the embodiment of the present application will be described below with reference to fig. 14 to 16.

As shown in fig. 14, an apparatus 600 for image processing according to an embodiment of the present application includes: a first processing module 610, a second processing module 620, and a determination module 630. Specifically, the first processing module 610 is configured to: processing each frame of image in a plurality of frames of images by adopting a CNN structure to obtain a multi-channel characteristic image of each frame of image; the second processing module 620 is configured to: processing the multi-channel feature map of the multi-frame image by adopting an RNN structure to obtain a single-channel saliency map of each frame of image; the determining module 630 is configured to: and determining the ROI position of each frame of image according to the corresponding saliency map of each frame of image.

It should be understood that the apparatus 600 for image processing according to the embodiment of the present application may correspond to the method 300 for performing the embodiment of the present application, and the above and other operations and/or functions of the respective modules in the apparatus 600 are respectively for implementing the corresponding flows of the respective methods in fig. 1 to 13, and are not repeated herein for brevity.

It should be understood that the apparatus of the embodiments of the present application may also be implemented based on a memory and a processor, wherein each memory is used for storing instructions for executing the method of the embodiments of the present application, and the processor executes the instructions to make the apparatus execute the method of the embodiments of the present application.

Specifically, as shown in fig. 15, an apparatus 700 for image processing according to an embodiment of the present application includes: a processor 710 and a memory 720. In particular, the processor 710 and the memory 720 are coupled via a bus system, the memory 720 being configured to store instructions, and the processor 710 being configured to execute instructions stored by the memory 720. Processor 710 may invoke program code stored in memory 720 to perform the following operations: processing each frame of image in a plurality of frames of images by adopting a CNN structure to obtain a multi-channel characteristic image of each frame of image; processing the multi-channel feature map of the multi-frame image by adopting an RNN structure to obtain a single-channel saliency map of each frame of image; and determining the ROI position of each frame of image according to the corresponding saliency map of each frame of image.

Optionally, as an embodiment, the processor 710 is configured to: performing continuous volume and/or pooling operation on each frame of image to obtain a plurality of spatial feature maps of each frame of image, wherein the plurality of spatial feature maps have different resolutions; performing deconvolution and/or convolution operations on each of the plurality of spatial feature maps to obtain a plurality of single-channel feature maps of each frame of image, the plurality of single-channel feature maps having the same resolution; and combining the plurality of single-channel feature maps of each frame of image into a multi-channel feature map of each frame of image.

Optionally, as an embodiment, the processor 710 is configured to: and according to a preset network model structure, carrying out continuous convolution and pooling operation on each frame of image to obtain at least three spatial feature maps of each frame of image, wherein the at least three spatial feature maps have different resolutions.

Optionally, as an embodiment, the preset network model structure is a VGG-16 structure, and the processor 710 is configured to: and according to the VGG-16 structure, carrying out five groups of convolution pooling operations on each frame of image so as to obtain three spatial feature maps of each frame of image, wherein the five groups of convolution pooling operations comprise 13 layers of convolution.

Optionally, as an embodiment, the resolution of each frame of image is w × h, and the resolutions of the three spatial feature maps are respectively:

and

optionally, as an embodiment, the processor 710 is configured to: performing deconvolution operation on each spatial feature map to obtain a plurality of feature maps with the same resolution; performing a convolution operation on each of the plurality of feature maps to obtain the plurality of single-channel feature maps.

Optionally, as an embodiment, the resolution of each frame of image is w × h, and the resolutions of the multiple feature maps are all w × h.

Optionally, as an embodiment, a deconvolution step size in the deconvolution operation is set to 2.

Optionally, as an embodiment, the processor 710 is configured to: and adopting 1 × 1 convolution layers for each feature map to obtain the plurality of single-channel feature maps.

Optionally, as an embodiment, the RNN structure is an LSTM structure.

Optionally, as an embodiment, the processor 710 is configured to: sequentially inputting the multi-channel feature maps of the multi-frame images to the LSTM structure according to a time sequence so as to output the multi-channel processed feature maps corresponding to each frame of image; and applying 1 × 1 convolution layers to the processed feature map to obtain a single-channel saliency map of each frame image.

Optionally, as an embodiment, the processor 710 is configured to: at the t-th moment, inputting the multi-channel feature map of the t-th frame image into the LSTM structure, and outputting the cell state c according to the t-1-th moment_t-1And hidden state h_t-1Outputting the multi-channel processed characteristic diagram corresponding to the t-th frame image and outputting the cell state c at the t-th moment_tAnd hidden state h_tAnd t is any positive integer.

Optionally, as an embodiment, the size of the circulation layer of the LSTM structure is set to 10.

Optionally, as an embodiment, the processor 710 is configured to: and determining the ROI position of each frame of image according to pixel values of different positions in the saliency map corresponding to each frame of image, wherein the ROI position comprises the central coordinate and/or the size of the ROI.

Optionally, as an embodiment, the processor 710 is configured to: and determining the coordinate of the point with the maximum pixel value in the saliency map corresponding to each frame of image as the central coordinate of the ROI of each frame of image.

Optionally, as an embodiment, the processor 710 is configured to: determining coordinates of a plurality of points of which pixel values are greater than or equal to a first preset value in a saliency map corresponding to each frame of image; determining an average of coordinates of the plurality of points as a center coordinate of the ROI of the each frame image.

Optionally, as an embodiment, the processor 710 is configured to: and determining the size of the ROI of each frame of image according to the size of each frame of image.

Optionally, as an embodiment, the processor 710 is configured to: setting a size of the ROI of the each frame image to 1/4 of the size of the each frame image.

Optionally, as an embodiment, the processor 710 is configured to: determining coordinates of a plurality of points of which the pixel values are greater than or equal to a second preset value in the saliency map corresponding to each frame of image; determining two points in the plurality of points, wherein the absolute value of the horizontal coordinate difference value of the two points is the maximum and/or the absolute value of the vertical coordinate difference value of the two points is the maximum; and determining the size of the ROI of each frame of image according to the absolute value of the horizontal coordinate difference value and/or the absolute value of the vertical coordinate difference value of the two points.

Optionally, as an embodiment, the processor 710 is configured to perform at least one of the following steps: if the absolute value of the horizontal coordinate difference value of the two points is greater than or equal to a preset length, determining the length of the ROI of each frame of image according to the ratio of the absolute value of the horizontal coordinate difference value of the two points to the preset length; if the absolute value of the horizontal coordinate difference value of the two points is smaller than the preset length, determining the preset length as the length of the ROI of each frame of image; if the absolute value of the difference value of the vertical coordinates of the two points is larger than or equal to the preset width, determining the width of the ROI of each frame of image according to the ratio of the absolute value of the difference value of the vertical coordinates of the two points to the preset width; and if the absolute value of the difference value of the vertical coordinates of the two points is smaller than the preset width, determining the preset width as the width of the ROI of each frame of image.

It should be understood that the apparatus 700 for image processing according to the embodiment of the present application may correspond to the apparatus 600 in the embodiment of the present application and may correspond to performing the method 300 in the embodiment of the present application, and the above and other operations and/or functions of various parts in the apparatus 700 are respectively for implementing corresponding flows of the methods in fig. 1 to fig. 13, and are not described herein again for brevity.

Therefore, in the image processing apparatus according to the embodiment of the present application, in order to solve the bandwidth problem in the image processing process, it is considered that the region outside the ROI is blurred by filtering, the high frequency information of the image is reduced, the compression rate is improved, and finally the bandwidth is reduced. In order to accurately obtain the position of the ROI, a visual attention prediction model based on deep learning is adopted in the embodiment of the application, the region of interest of human eyes can be predicted in real time according to the content of a video, the system time delay is reduced, the real-time performance and the practicability of the system are improved, and the transportability among platforms in the system is improved.

It should be understood that the Processor mentioned in the embodiments of the present Application may be a Central Processing Unit (CPU), and may also be other general purpose processors, Digital Signal Processors (DSP), Application Specific Integrated Circuits (ASIC), Field Programmable Gate Arrays (FPGA) or other Programmable logic devices, discrete Gate or transistor logic devices, discrete hardware components, and the like. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

It will also be appreciated that the memory referred to in the embodiments of the application may be either volatile memory or nonvolatile memory, or may include both volatile and nonvolatile memory. The non-volatile Memory may be a Read-Only Memory (ROM), a Programmable ROM (PROM), an Erasable PROM (EPROM), an Electrically Erasable PROM (EEPROM), or a flash Memory. Volatile Memory can be Random Access Memory (RAM), which acts as external cache Memory. By way of example, but not limitation, many forms of RAM are available, such as Static random access memory (Static RAM, SRAM), Dynamic Random Access Memory (DRAM), Synchronous Dynamic random access memory (Synchronous DRAM, SDRAM), Double Data Rate Synchronous Dynamic random access memory (DDR SDRAM), Enhanced Synchronous SDRAM (ESDRAM), Synchronous link SDRAM (SLDRAM), and Direct Rambus RAM (DR RAM).

It should be noted that when the processor is a general-purpose processor, a DSP, an ASIC, an FPGA or other programmable logic device, a discrete gate or transistor logic device, or a discrete hardware component, the memory (memory module) is integrated in the processor.

It should be noted that the memory described herein is intended to comprise, without being limited to, these and any other suitable types of memory.

Embodiments of the present application further provide a computer-readable storage medium, on which instructions are stored, and when the instructions are executed on a computer, the instructions cause the computer to execute the method of each of the above method embodiments.

An embodiment of the present application further provides a computing device, which includes the computer-readable storage medium.

The embodiment of the application can be applied to the aircraft, especially the unmanned aerial vehicle field.

Optionally, the embodiment of the application further provides a movable platform. In particular, fig. 16 shows a schematic block diagram of a movable platform 800 of an embodiment of the present application. As shown in fig. 16, the movable platform 800 includes: a body 810; a power system 820 disposed in the body 810 for providing power to the movable platform 800; one or more processors 830 configured to perform the method 300 of embodiments of the application. The processor 830 may include the image processing apparatus 600 of the embodiment of the present application; optionally, the movable platform 800 may further include: an image capturing device for capturing an image, so that the processor 830 can process the captured image, for example, perform any one of the image processing methods described above on the captured image.

The movable platform 800 in embodiments of the present invention may refer to any movable device that may move in any suitable environment, such as in the air (e.g., a fixed-wing aircraft, a rotorcraft, or an aircraft without both fixed wings and rotors), in water (e.g., a ship or submarine), on land (e.g., an automobile or train), in space (e.g., a space plane, a satellite, or a probe), and any combination thereof. The mobile device may be an aircraft, such as an Unmanned Aerial Vehicle (UAV).

The body 810 may also be referred to as a fuselage, which may include a central frame and one or more arms coupled to the central frame, the one or more arms extending radially from the central frame. The foot rest is connected with the fuselage and used for supporting when the UAV lands.

The power system 820 may include an electronic governor (abbreviated as an electric governor), one or more propellers, and one or more motors corresponding to the one or more propellers, wherein the motors are connected between the electronic governor and the propellers, and the motors and the propellers are disposed on corresponding horn arms; the electronic speed regulator is used for receiving a driving signal generated by the flight controller and providing a driving current to the motor according to the driving signal so as to control the rotating speed of the motor. The motor is used to drive the propeller to rotate, thereby providing power for the flight of the UAV, which enables the UAV to achieve motion in one or more degrees of freedom. It should be understood that the motor may be a dc motor or an ac motor. In addition, the motor can be a brushless motor, and can also be a brush motor.

The image acquisition device includes a photographing apparatus (e.g., a camera, a video camera, etc.) or a vision sensor (e.g., a monocular camera or a dual/multi-view camera, etc.).

Optionally, this application embodiment has still provided an unmanned aerial vehicle's unmanned aerial vehicle system. Specifically, the unmanned flight system 900 including the unmanned aerial vehicle will be described below with reference to fig. 17. The present embodiment is described by taking a rotorcraft as an example.

Unmanned flight system 900 may include UAV 910, carrier 920, display device 930, and remote control 940. UAV 910 may include, among other things, a power system 950, a flight control system 960, and a frame 970. UAV 910 may wirelessly communicate with a remote control 940 and a display device 930.

The frame 970 may include a fuselage and a foot rest (also referred to as a landing gear). The fuselage may include a central frame and one or more arms connected to the central frame, the one or more arms extending radially from the central frame. The foot rest is connected to the fuselage for support when UAV 910 lands.

The power system 950 may include an electronic governor (abbreviated as an electronic governor) 951, one or more propellers 953, and one or more motors 952 corresponding to the one or more propellers 953, wherein the motors 952 are connected between the electronic governor 951 and the propellers 953, the motors 952 and the propellers 953 being provided on corresponding arms; the electronic governor 951 is configured to receive a driving signal generated by the flight controller 960 and supply a driving current to the motor 952 according to the driving signal to control the rotational speed of the motor 952. Motor 952 is used to drive propeller rotation to provide power for the flight of UAV 910, which enables UAV 910 to move in one or more degrees of freedom. It should be understood that the motor 952 may be a dc motor or an ac motor. In addition, the motor 952 may be a brushless motor or a brush motor.

Flight control system 960 may include a flight controller 961 and a sensing system 962. The sensing system 962 is used to measure the attitude information of the UAV. The sensing System 962 may include, for example, at least one of a gyroscope, an electronic compass, an IMU (Inertial Measurement Unit), a visual sensor (e.g., a monocular camera, a binocular camera, etc.), a GPS (Global Positioning System), a barometer, a visual Inertial navigation odometer, and the like. Flight controller 961 is used to control the flight of UAV 910, for example, the flight of UAV 910 may be controlled based on attitude information measured by sensing system 962.

The carrier 920 may be used to carry a load 980. For example, when the carrier 920 is a pan-tilt apparatus, the load 980 may be a camera (e.g., a camera, a video camera, etc.), embodiments of the present application are not limited thereto, and the carrier may also be a carrying apparatus for carrying weapons or other loads, for example.

Display device 930 is located at the ground end of unmanned flight system 900, may communicate with UAV 910 wirelessly, and may be used to display pose information for UAV 910. In addition, when the load 980 is a photographing device, an image photographed by the photographing device may also be displayed on the display device 930. It should be understood that the display device 930 may be a stand-alone device or may be provided in the remote control 940. For example, the receiving and decoding module 260 may be installed on a display device for displaying the image after performing the blurring sharpening process.

A remote control 940 is located at the ground end of unmanned flight system 900 and may communicate wirelessly with UAV 910 for remote maneuvering of UAV 910. The remote control device may be, for example, a remote controller or a remote control device installed with an APP (Application) that controls the UAV, such as a smartphone, a tablet computer, or the like. In the embodiment of the application, the input of the user is received through the remote control device, which may mean that the UAV is controlled through an input device such as a dial, a button, a key, or a joystick on the remote control device or a User Interface (UI) on the remote control device.

In addition to the above-mentioned mobile devices, the embodiments of the present invention can be applied to other vehicles with cameras, such as Virtual Reality (VR)/Augmented Reality (AR) glasses.

It should be understood that the division of circuits, sub-units of the various embodiments of the present application is illustrative only. Those of ordinary skill in the art will appreciate that the various illustrative circuits, sub-circuits, and sub-units described in connection with the embodiments disclosed herein can be split or combined.

In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. The processes or functions according to the embodiments of the present application are generated in whole or in part when the computer instructions are loaded and executed on a computer. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored on a computer readable storage medium or transmitted from one computer readable storage medium to another, for example, from one website, computer, server, or data center to another website, computer, server, or data center via wire (e.g., coaxial cable, fiber optic, Digital Subscriber Line (DSL)), or wireless (e.g., infrared, wireless, microwave, etc.). The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device, such as a server, a data center, etc., that incorporates one or more of the available media. The usable medium may be a magnetic medium (e.g., floppy Disk, hard Disk, magnetic tape), an optical medium (e.g., Digital Video Disk (DVD)), or a semiconductor medium (e.g., Solid State Disk (SSD)), among others.

It should be understood that the embodiments of the present application are described with respect to a total bit width of 16 bits (bit), and the embodiments of the present application may be applied to other bit widths.

It should be appreciated that reference throughout this specification to "one embodiment" or "an embodiment" means that a particular feature, structure or characteristic described in connection with the embodiment is included in at least one embodiment of the present application. Thus, the appearances of the phrases "in one embodiment" or "in an embodiment" in various places throughout this specification are not necessarily all referring to the same embodiment. Furthermore, the particular features, structures, or characteristics may be combined in any suitable manner in one or more embodiments.

It should be understood that in the embodiment of the present application, "B corresponding to a" means that B is associated with a, from which B can be determined. It should also be understood that determining B from a does not mean determining B from a alone, but may be determined from a and/or other information.

It should be understood that the term "and/or" herein is merely one type of association relationship that describes an associated object, meaning that three relationships may exist, e.g., a and/or B may mean: a exists alone, A and B exist simultaneously, and B exists alone. In addition, the character "/" herein generally indicates that the former and latter related objects are in an "or" relationship.

Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit.

The above description is only for the specific embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present application, and shall be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A method of image processing, comprising:

processing each frame of image in a plurality of frames of images by adopting a Convolutional Neural Network (CNN) structure to obtain a multichannel characteristic diagram of each frame of image;

processing the multi-channel feature map of the multi-frame image by adopting a Recurrent Neural Network (RNN) structure to obtain a single-channel saliency map of each frame image;

and determining the position of a region of interest (ROI) of each frame of image according to the corresponding saliency map of each frame of image.

2. The method according to claim 1, wherein the processing each frame of image in multiple frames of images by using a Convolutional Neural Network (CNN) structure to obtain a multi-channel feature map of each frame of image comprises:

performing continuous volume and/or pooling operation on each frame of image to obtain a plurality of spatial feature maps of each frame of image, wherein the plurality of spatial feature maps have different resolutions;

performing deconvolution and/or convolution operations on each of the plurality of spatial feature maps to obtain a plurality of single-channel feature maps of each frame of image, the plurality of single-channel feature maps having the same resolution;

and combining the plurality of single-channel feature maps of each frame of image into a multi-channel feature map of each frame of image.

3. The method according to claim 2, wherein the performing successive convolution and/or pooling operations on each frame of image to obtain a plurality of spatial feature maps of each frame of image comprises:

and according to a preset network model structure, carrying out continuous convolution and pooling operation on each frame of image to obtain at least three spatial feature maps of each frame of image, wherein the at least three spatial feature maps have different resolutions.

4. The method of claim 3, wherein the predetermined network model structure is a VGG-16 structure,

the performing continuous convolution and pooling operations on each frame of image according to the preset network model structure to obtain at least three spatial feature maps of each frame of image comprises:

and according to the VGG-16 structure, carrying out five groups of convolution pooling operations on each frame of image so as to obtain three spatial feature maps of each frame of image, wherein the five groups of convolution pooling operations comprise 13 layers of convolution.

5. The method according to claim 4, wherein the resolution of each frame of image is w × h, and the resolutions of the three spatial feature maps are respectively:

and

6. the method according to any one of claims 2 to 5, wherein the performing deconvolution and/or convolution operations on each of the plurality of spatial feature maps to obtain a plurality of single-channel feature maps of each frame image comprises:

performing deconvolution operation on each spatial feature map to obtain a plurality of feature maps with the same resolution;

performing a convolution operation on each of the plurality of feature maps to obtain the plurality of single-channel feature maps.

7. The method of claim 6, wherein the resolution of each frame of image is w × h, and the resolutions of the plurality of feature maps are w × h.

8. The method according to claim 6 or 7, wherein the deconvolution step size in the deconvolution operation is set to 2.

9. The method according to any one of claims 6 to 8, wherein the convolving each of the plurality of feature maps to obtain the plurality of single-channel feature maps comprises:

and adopting 1 × 1 convolution layers for each feature map to obtain the plurality of single-channel feature maps.

10. The method according to any one of claims 1 to 9, wherein the RNN structure is a long-and-short memory network (LSTM) structure.

11. The method according to claim 10, wherein the processing the multi-channel feature map of the multiple frames of images by using a Recurrent Neural Network (RNN) structure to obtain a single-channel saliency map of each frame of image comprises:

sequentially inputting the multi-channel feature maps of the multi-frame images to the LSTM structure according to a time sequence so as to output the multi-channel processed feature maps corresponding to each frame of image;

and applying 1 × 1 convolution layers to the processed feature map to obtain a single-channel saliency map of each frame image.

12. The method of claim 11, wherein the sequentially inputting the multi-channel feature maps of the multiple frames of images to the LSTM structure in time order to output the processed feature maps of multiple channels corresponding to each frame of image comprises:

at the t-th moment, inputting the multi-channel feature map of the t-th frame image into the LSTM structure, and outputting the cell state c according to the t-1-th moment_t-1And hidden state h_t-1Outputting the multi-channel processed characteristic diagram corresponding to the t-th frame image and outputting the cell state c at the t-th moment_tAnd hidden state h_tAnd t is any positive integer.

13. The method of claim 11 or 12, wherein the size of the cyclic layer of the LSTM structure is set to 10.

14. The method according to any one of claims 1 to 13, wherein the determining the ROI position of each frame of image according to the corresponding saliency map of each frame of image comprises:

and determining the ROI position of each frame of image according to pixel values of different positions in the saliency map corresponding to each frame of image, wherein the ROI position comprises the central coordinate and/or the size of the ROI.

15. The method according to claim 14, wherein the determining the ROI position of each frame of image according to pixel values of different positions in the saliency map corresponding to each frame of image, the ROI position comprising the center coordinate and/or size of the ROI comprises:

and determining the coordinate of the point with the maximum pixel value in the saliency map corresponding to each frame of image as the central coordinate of the ROI of each frame of image.

16. The method according to claim 14, wherein the determining the ROI position of each frame of image according to pixel values of different positions in the saliency map corresponding to each frame of image, the ROI position comprising the center coordinate and/or size of the ROI comprises:

determining coordinates of a plurality of points of which pixel values are greater than or equal to a first preset value in a saliency map corresponding to each frame of image;

determining an average of coordinates of the plurality of points as a center coordinate of the ROI of the each frame image.

17. The method according to any one of claims 14 to 16, wherein the determining the ROI position of each frame of image according to pixel values of different positions in the saliency map corresponding to each frame of image, the ROI position comprising the center coordinate and/or size of the ROI comprises:

and determining the size of the ROI of each frame of image according to the size of each frame of image.

18. The method of claim 17, wherein determining the region of the ROI for each frame of image according to the size of each frame of image comprises:

setting a size of the ROI of the each frame image to 1/4 of the size of the each frame image.

19. The method according to any one of claims 14 to 16, wherein the determining the ROI position of each frame of image according to pixel values of different positions in the saliency map corresponding to each frame of image, the ROI position comprising the center coordinate and/or size of the ROI comprises:

determining coordinates of a plurality of points of which the pixel values are greater than or equal to a second preset value in the saliency map corresponding to each frame of image;

determining two points in the plurality of points, wherein the absolute value of the horizontal coordinate difference value of the two points is the maximum and/or the absolute value of the vertical coordinate difference value of the two points is the maximum;

and determining the size of the ROI of each frame of image according to the absolute value of the horizontal coordinate difference value and/or the absolute value of the vertical coordinate difference value of the two points.

20. The method according to claim 19, wherein the determining the size of the ROI of each frame image according to the absolute value of the abscissa difference value and/or the absolute value of the ordinate difference value of the two points comprises at least one of the following steps:

if the absolute value of the horizontal coordinate difference value of the two points is greater than or equal to a preset length, determining the length of the ROI of each frame of image according to the ratio of the absolute value of the horizontal coordinate difference value of the two points to the preset length;

if the absolute value of the horizontal coordinate difference value of the two points is smaller than the preset length, determining the preset length as the length of the ROI of each frame of image;

if the absolute value of the difference value of the vertical coordinates of the two points is larger than or equal to the preset width, determining the width of the ROI of each frame of image according to the ratio of the absolute value of the difference value of the vertical coordinates of the two points to the preset width;

and if the absolute value of the difference value of the vertical coordinates of the two points is smaller than the preset width, determining the preset width as the width of the ROI of each frame of image.

21. An apparatus for image processing, comprising: a processor and a memory, wherein the processor is capable of processing a plurality of data,

the memory is configured to store instructions for use in,

the processor is to execute the memory-stored instructions and, when the processor executes the memory-stored instructions, the processor is to:

22. The apparatus of claim 21, wherein the processor is configured to:

23. The apparatus of claim 22, wherein the processor is configured to:

24. The apparatus of claim 23, wherein the predetermined network model structure is a VGG-16 structure,

the processor is configured to:

25. The apparatus of claim 24, wherein the resolution of each frame of image is w × h, and the resolutions of the three spatial feature maps are respectively:

and

26. the apparatus of any one of claims 22 to 25, wherein the processor is configured to:

27. The apparatus of claim 26, wherein the resolution of each frame of image is w × h, and the resolutions of the plurality of feature maps are w × h.

28. The apparatus according to claim 26 or 27, wherein the deconvolution step size in the deconvolution operation is set to 2.

29. The apparatus of any one of claims 26 to 28, wherein the processor is configured to:

30. The apparatus of any of claims 21 to 29, wherein the RNN structure is a long-term memory network (LSTM) structure.

31. The apparatus of claim 30, wherein the processor is configured to:

32. The apparatus of claim 31, wherein the processor is configured to:

33. The apparatus of claim 31 or 32, wherein the size of the cyclic layer of the LSTM structure is set to 10.

34. The apparatus according to any one of claims 21 to 33, wherein the processor is configured to:

35. The apparatus of claim 34, wherein the processor is configured to:

36. The apparatus of claim 34, wherein the processor is configured to:

37. The apparatus of any one of claims 34 to 36, wherein the processor is configured to:

38. The apparatus of claim 37, wherein the processor is configured to:

39. The apparatus of any one of claims 34 to 36, wherein the processor is configured to:

40. The apparatus of claim 39, wherein the processor is configured to perform at least one of the following:

41. A computer-readable storage medium, having stored thereon a computer program which, when executed, implements the method of any one of claims 1 to 20.

42. A computer program product comprising instructions which, when executed by a computer, cause the computer to perform the method of any one of claims 1 to 20.

43. A movable platform, comprising:

a body;

the power system is arranged in the machine body and used for providing power for the movable platform; and

one or more processors configured to perform the method of any of the above claims 1-20.

44. A system, comprising: the movable platform and display device of claim 43,

the movable platform is connected with the display device in a wired or wireless mode.