WO2021237555A1 - 图像处理的方法、装置、可移动平台以及系统 - Google Patents

图像处理的方法、装置、可移动平台以及系统 Download PDF

Info

Publication number
WO2021237555A1
WO2021237555A1 PCT/CN2020/092827 CN2020092827W WO2021237555A1 WO 2021237555 A1 WO2021237555 A1 WO 2021237555A1 CN 2020092827 W CN2020092827 W CN 2020092827W WO 2021237555 A1 WO2021237555 A1 WO 2021237555A1
Authority
WO
WIPO (PCT)
Prior art keywords
image
frame
roi
feature maps
channel
Prior art date
Application number
PCT/CN2020/092827
Other languages
English (en)
French (fr)
Inventor
李恒杰
赵文军
Original Assignee
深圳市大疆创新科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳市大疆创新科技有限公司 filed Critical 深圳市大疆创新科技有限公司
Priority to CN202080004902.6A priority Critical patent/CN112673380A/zh
Priority to PCT/CN2020/092827 priority patent/WO2021237555A1/zh
Publication of WO2021237555A1 publication Critical patent/WO2021237555A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/24Aligning, centring, orientation detection or correction of the image
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/40Extraction of image or video features

Definitions

  • This application relates to the field of image processing, and in particular to an image processing method, device, movable platform, and system.
  • the captured images and videos are usually transmitted in real time, which requires a large bandwidth.
  • the image can be blurred by filtering. For example, it is possible to maintain the original value of the pixel points for the region of interest (ROI) in the image, and to reduce the high frequency information by means such as mean filtering or Gaussian filtering based on the same or different filter radius for other regions.
  • ROI region of interest
  • This application provides an image processing method, device, movable platform, and system, which can determine the position of the ROI more accurately, reduce the system delay, and improve the real-time performance of the system.
  • an image processing method including: using a convolutional neural network (CNN) structure to process each frame of a multi-frame image to obtain a multi-channel feature map of each frame of the image; Using a recurrent neural network (RNN) structure, the multi-channel feature map of the multi-frame image is processed to obtain the single-channel saliency map of each frame of image; according to the saliency map corresponding to each frame of image, The position of the region of interest (ROI) of each frame of image is determined.
  • CNN convolutional neural network
  • RNN recurrent neural network
  • an image processing apparatus which is used to execute the foregoing first aspect or any possible implementation of the first aspect.
  • the device includes a unit for executing the above-mentioned first aspect or any possible implementation of the first aspect.
  • an image processing apparatus including: a storage unit and a processor, the storage unit is used to store instructions, the processor is used to execute the instructions stored in the memory, and when the processor executes the instructions stored in the memory When an instruction is executed, the execution causes the processor to execute the method in the first aspect or any possible implementation of the first aspect.
  • a computer-readable medium for storing a computer program, and the computer program includes instructions for executing the first aspect or any possible implementation of the first aspect.
  • a computer program product including instructions is provided.
  • the computer runs the instructions of the computer program product
  • the computer executes the image in the first aspect or any possible implementation of the first aspect. Processing method.
  • the computer program product can run on the image processing apparatus of the third aspect described above.
  • a movable platform including: a body; a power system, which is provided in the body, and is used to provide power to the movable platform; and one or more processors, which are used to execute the above-mentioned first An image processing method in one aspect or any possible implementation of the first aspect.
  • a system including the movable platform of the sixth aspect described above and a display device, where the movable platform is connected to the display device in a wired or wireless connection.
  • Fig. 1 is a schematic block diagram of an image processing system.
  • Fig. 2 is a schematic diagram of the position of an ROI in an image in an embodiment of the present application.
  • Fig. 3 is a schematic diagram of the data flow in the image processing system in Fig. 1.
  • Fig. 4 is a schematic block diagram of an image processing system according to an embodiment of the present application.
  • FIG. 5 is a schematic flowchart of an image processing method according to an embodiment of the present application.
  • Fig. 6 is a schematic diagram of a flow of an image processing method according to an embodiment of the present application.
  • FIG. 7 is a schematic diagram of image processing using a CNN structure according to an embodiment of the present application.
  • Figure 8-9 is a general schematic diagram of the RNN structure.
  • Figure 10 is a general schematic diagram of the LSTM structure.
  • FIG. 11 is a schematic diagram of image processing using an RNN structure according to an embodiment of the present application.
  • Fig. 12 is a schematic diagram of a saliency map with concentrated heat distribution in an embodiment of the present application.
  • FIG. 13 is a schematic diagram of a saliency map of heat distribution dispersion in an embodiment of the present application.
  • Fig. 14 is a schematic block diagram of an image processing device according to an embodiment of the present application.
  • FIG. 15 is a schematic block diagram of another image processing device according to an embodiment of the present application.
  • FIG. 16 is a schematic block diagram of a movable platform according to an embodiment of the present application.
  • Fig. 17 is a schematic diagram of an unmanned aerial system according to an embodiment of the present application.
  • the captured images and videos are usually transmitted in real time, which requires a large bandwidth.
  • the image can be blurred by filtering. For example, it is possible to maintain the original value of the pixel points for the region of interest (ROI) in the image, and to reduce the high frequency information by means such as mean filtering or Gaussian filtering based on the same or different filter radius for other regions.
  • ROI region of interest
  • FIG. 1 shows a schematic block diagram of the image processing system 100.
  • the system 100 may include a lens acquisition module 110, an image signal processing (ISP) module 120, a blur (Blur) sharpening (Sharpen) module 130, an encoding and transmission module 140, and a receiving module. And decoding module 150 and eye tracking module 160.
  • ISP image signal processing
  • Blu Blu
  • Blu Blu
  • Eye tracking module 160 decoding module 150 and eye tracking module 160.
  • the image collected by the lens collection module 110 is first processed by the ISP module 120. Specifically, the image collected by the lens collection module 110 may be converted into an electrical signal by a photosensitive element, and then transmitted to the ISP module 120 for processing, thereby being converted into a digital image.
  • the lens collection module 110 may refer to a camera.
  • the image collected by the lens collection module 220 may refer to one or more frames of images obtained by the camera through taking pictures, or may refer to the video images obtained through video recording. There is no restriction on this.
  • the image signal output by the ISP module 120 is combined with the position information of the ROI and is input to the blur sharpening module 130 to perform corresponding processing on different regions in the video image.
  • the processed image outlines the area to be processed in the form of boxes, circles, ellipses, irregular polygons, etc., which is called ROI.
  • the ISP module 120 may refer to a processor or a processing circuit, and the embodiment of the present application is not limited thereto.
  • the blur sharpening module 130 performs sharpen and blur operations inside and outside the ROI according to the position of the ROI and combined with bandwidth requirements. While improving the image and video quality in the ROI, it greatly reduces the bandwidth occupation during image transmission.
  • Figure 2 is a schematic diagram of the position of the ROI in the image. As shown in Figure 2, the inside of the ROI is the Sharpen area, and the outside is the Blur area. The reason for being divided into multiple layers from the inside to the outside is to achieve the effect of gradual blur. In other words, if the Sharpen operation is performed in the ROI area, the image details are enhanced and the image quality is improved; while the outside of the ROI is gradually blurred from the inside to the outside, which can achieve a smooth blur effect.
  • the defocusing and sharpening module 130 sends the processed image to the encoding and transmission module 140 for encoding, and transmits it from the drone end to the ground end; at the ground end, the receiving and decoding module 150 receives the encoded data for processing, and displays it to the The user allows the user to observe a local high-quality image.
  • the eye tracking module 160 obtains the region of interest, that is, the ROI according to the movement of the human eye, so that the ROI position information is transmitted to the blur sharpening module 130.
  • the encoding and transmission module 140 may include an encoder or a transmitter; the receiving and decoding module 150 may include a decoder or a receiver, but the embodiment of the present application is not limited thereto.
  • FIG. 3 is a schematic diagram of the data flow in the system 100. From the data flow shown in FIG. 3, it can be seen that there are multiple delays in the system 100, for example, the round-trip delay of image transmission (that is, the encoding, sending, and decoding included in D1). And the upstream feedback process), the delay of the eye tracking algorithm (ie D2 and D3), and the delay of the Blur and Sharpen processing algorithm (ie D4).
  • the round-trip delay of image transmission that is, the encoding, sending, and decoding included in D1
  • the delay of the eye tracking algorithm ie D2 and D3
  • the delay of the Blur and Sharpen processing algorithm ie D4
  • the UAV because the eye tracking module 160 is on the ground side, the UAV has a long round-trip transmission distance from the ground side, and the transmission time is long, which makes the delays D1, D2, and D3 associated with it relatively large, which results in a long delay of the system 100. high.
  • the position of the ROI may also be misaligned, so that the human eyes on the ground cannot observe high-quality images, but observe blurred images.
  • this application provides a method and system for image processing.
  • a visual attention prediction scheme based on deep learning can accurately predict the center position of the ROI, thereby avoiding the above-mentioned delay process and improving the real-time performance of the entire system; At the same time, eye-tracking equipment can be omitted, and practicability and portability can be improved.
  • FIG. 4 shows a schematic block diagram of an image processing system 200 according to an embodiment of the present application.
  • the system 200 includes a lens acquisition module 210, an ISP module 220, a visual saliency prediction module 230, a blur (Blur) sharpening (Sharpen) module 240, an encoding and transmission module 250, and a receiving and decoding module 260 .
  • the captured image is first processed by the ISP module 220 by the lens acquisition module 210 in the system 200.
  • the image signal output after processing by the ISP module 220 is combined with the position information of the ROI, and is input to the blur sharpening module 240 to perform corresponding processing on different regions in the video image.
  • the ROI position information in the system 200 is determined by the visual saliency prediction module 230 based on the results of deep learning, that is, the visual saliency prediction module 230 outputs the position of the ROI, for example, output
  • the center position of the ROI is used as a parameter of the blur sharpening module 240.
  • the lens acquisition module 210 may be suitable for the description of the lens acquisition module 110, for example, the lens acquisition module 210 may be a camera; the ISP module 220 may be suitable for the description of the ISP module 120, for example, the ISP module 220 may be ISP module processing circuit; the blur sharpening module 240 may be suitable for the description of the blur sharpening module 130; the encoding and transmission module 250 may be suitable for the description of the encoding and transmission module 140, and the receiving and decoding module 260 may be suitable for receiving For the description related to the decoding module 150, for example, the encoding and transmission module 250 may include an encoder and a transmitter, and the receiving and decoding module 260 may include a receiver and a decoder. For the sake of brevity, details are not repeated here.
  • the system 200 of the embodiment of the present application determines the ROI by the video saliency prediction module, without manual feature design and complex calculations, and can predict the visual attention area of the video in real time from end to end, avoiding eye tracking solutions The delay caused by the problem.
  • the video saliency prediction module is a processing circuit.
  • FIG. 5 shows a schematic flowchart of an image processing method 300 according to an embodiment of the present application.
  • the method 300 may be executed by an image processing apparatus, for example, the image processing apparatus may be the aforementioned visual saliency prediction module 230, but the embodiment of the present application is not limited thereto.
  • the method 300 includes: S310, using a convolutional neural network (Convolutional Neural Network, CNN) structure to process each frame of a multi-frame image to obtain the multi-channel feature of each frame of image Figure (feature-map); S320, using a recurrent neural network (RNN) structure to process the multi-channel feature map of the multi-frame image to obtain the single-channel saliency map of each frame image (saliency-map); S330, determining the ROI position of each frame of image according to the saliency map corresponding to each frame of image.
  • CNN convolutional Neural Network
  • RNN recurrent neural network
  • Deep learning originated from the study of neural networks.
  • neural networks have been proposed and developed rapidly on this basis.
  • the convolutional neural network Convolutional Neural Network, CNN
  • the amount of network parameters has been greatly reduced, and the training speed and accuracy have been significantly improved. Therefore, neural networks have been rapidly developed and widely used in the image field. application.
  • the model used in the research work of dynamic human eye attention point detection detects visual attention in dynamic scenes by combining static saliency features with time domain information (such as optical flow field, time domain difference, etc.).
  • time domain information such as optical flow field, time domain difference, etc.
  • the work can be regarded as an extension of the existing static saliency model considering motion information.
  • These models rely heavily on feature engineering, so the performance of the models is limited by hand-designed features.
  • the image processing method of the embodiment of the present application uses a combination of CNN and RNN.
  • the main feature of CNN is that it has a strong ability to extract high-dimensional features. It is widely used in the field of image vision, such as target detection, face recognition and other practical applications, and has produced great success.
  • deep learning algorithms such as CNN do not need to manually select features, but extract these features by learning and training the network, and then use these extracted features to generate subsequent decision results to achieve classification , Recognition and other functions.
  • RNN The main feature of RNN is that it can mine time series information in data and the in-depth expression ability of semantic information, and has achieved breakthroughs in speech recognition, language models, machine translation, and time series analysis.
  • recurrent neural networks are used to process and predict sequence data.
  • the collected video sequence can be processed to obtain the image of each frame.
  • the saliency area and then determine the location of the ROI.
  • the delay caused by the eye tracking module on the ground side and the round-trip process of image transmission in the existing system can be avoided, and the real-time performance of the original system can be greatly improved; the practicality and portability of the system can be improved.
  • FIG. 6 shows a schematic diagram of the flow of a method 300 of image processing.
  • a multi-frame image to be processed is acquired, and the multi-frame image may refer to any multi-frame image in the video data to be processed.
  • the to-be-processed video data may refer to the data output by the ISP module 220 in the system 200 shown in FIG. 4, for example, it may be a yuv format video sequence output by the ISP module 220.
  • the to-be-processed video data may include T-frame images, that is, the first frame image to the T-th frame image in FIG. 6, where the t-th frame image represents any one of the images.
  • Each frame of image to be processed is input into the CNN structure, and the CNN structure is used to process multiple images per frame, corresponding to the output of the multi-channel feature map of each frame of image; the multi-channel feature map is input into the RNN structure, for example, in Figure 6 Take the Long short-term memory (LSTM) structure as an example, and output the multi-channel feature map of each frame of image; after merging, the saliency map of each frame of image is finally obtained, and the saliency map can be obtained The position of the ROI area in each frame of the image.
  • LSTM Long short-term memory
  • S310 the CNN structure is used to process each frame of the multi-frame image to obtain the multi-channel feature map of each frame of the image.
  • S310 may include: S311, performing continuous convolution and/or pooling operations on each frame of image to obtain multiple spatial feature maps of each frame of image, the multiple spatial feature maps having Different resolutions; S312, perform deconvolution and/or convolution operations on each spatial feature map of the multiple spatial feature maps to obtain multiple single-channel feature maps of each frame of image, the multiple The single-channel feature maps have the same resolution; S313, combining the multiple single-channel feature maps of each frame of image into a multi-channel feature map of each frame of image.
  • FIG. 7 shows a schematic diagram of image processing using a CNN structure.
  • S311 for each frame of video image (for example, the image 400 in FIG. 7), continuous convolution and pooling operations (for example, operations 410-450 shown in FIG. 7) are performed to extract The spatial feature map of each frame of image (for example, the images obtained after operations 430-450 in FIG. 7), wherein the multiple spatial feature maps have different resolutions.
  • the neural network structure can be reasonably selected according to the needs of the actual application.
  • this application uses the network structure of the convolutional layer part of the pre-trained VGG-16 neural network
  • the VGG-16 network can extract the spatial characteristics of each frame of image well, but the application is not limited to this.
  • other convolution blocks of the same level deep convolutional neural network can also be selected to replace this part, such as ResNet, GoogLeNet, etc.
  • VGG-Net is a new deep convolutional neural network developed by Oxford University Computer Vision Group (Visual Geometry Group) and others.
  • VGG-Net explored the relationship between the depth of the convolutional neural network and its performance, and successfully constructed a 16-19-layer deep convolutional neural network, proving that increasing the depth of the network can affect the final performance of the network to a certain extent. The error rate is greatly reduced, while the scalability is strong, and the generalization of migration to other image data is also very good.
  • VGG is widely used to extract image features.
  • the network structure of the convolutional layer part of the pre-trained VGG-16 neural network is used.
  • the convolutional pooling processing part of the CNN structure can include 5 groups and 13 layers in total.
  • Convolution is to obtain multiple spatial feature maps of each frame of image.
  • three spatial feature maps of each frame of image are obtained as an example, and the three spatial feature maps have different resolutions.
  • the resolution of the image 400 is w ⁇ h
  • the image 400 may have multiple channels.
  • the image 400 may be the ISP module 220 shown in FIG. 4
  • Any one frame of image in the output video sequence in yuv format has three channels.
  • the image 400 undergoes a series of convolution and/or pooling operations.
  • operations 410-450 may be sequentially performed.
  • operation 410 includes two convolution operations, and the resolution is still w ⁇ h;
  • operation 420 includes two convolutions and one pooling, where the pooling operation can reduce the resolution to Operation 430 includes three convolutions and one pooling.
  • the pooling operation can reduce the resolution to After the operation 430, a spatial feature map is output; the image after the operation 430 is then subjected to the operation 440.
  • the operation 440 includes three convolutions and one pooling.
  • the pooling operation can reduce the resolution to After the operation 440, a spatial feature map is output; the image after the operation 440 is then subjected to the operation 450.
  • the operation 450 includes three convolutions and one pooling.
  • the pooling operation can reduce the resolution to After this operation 450, the last spatial feature map is output.
  • three spatial feature maps with different resolutions can be output, and the number of channels is 256, 512, and 512 respectively; in addition, starting from the input image 400, the five sets of convolution operations (410-450)
  • the channel numbers of feature-map are: 64, 128, 256, 512, and 512 respectively.
  • the three resolutions obtained here are with The feature map of is an example, but the embodiment of this application is not limited to this.
  • the number and resolution of the spatial feature maps obtained in the embodiment of this application can be set according to actual applications. For example, feature maps of other resolutions can be selected, or It is also possible to obtain more or fewer spatial feature maps.
  • each spatial feature map of the multiple spatial feature maps with different resolutions obtained after S311 perform deconvolution and/or convolution operations on each spatial feature map of the multiple spatial feature maps with different resolutions obtained after S311 to obtain multiple single-channel feature maps of each frame of image.
  • Multiple single-channel feature maps have the same resolution.
  • the deconvolution operation can be performed on each spatial feature map of the obtained multiple spatial feature maps to obtain multiple feature maps with the same resolution; each feature map of the multiple feature maps can be convolved Operate to obtain multiple single-channel feature maps (for example, the images respectively obtained after operations 460-480 in FIG. 7).
  • the resolutions of the five spatial feature maps outputted through the 5 groups of convolutions (5 convolutional blocks) included in S311 are 1, 1/2, and 1 of the resolution of the input video image 400, respectively. /4, 1/8 and 1/16 times.
  • the image feature-map obtained by S311 needs to be up-sampled, and the deconvolution layer is set for up-sampling to increase the resolution.
  • three deconvolution modules (corresponding to operations 460-480 in Figure 7) need to be set correspondingly to separately perform the The three spatial feature maps output by the last three groups of convolution modules are up-sampled.
  • the reason for choosing the last three sets of convolution modules here is that the fusion of multiple layers of higher-level features can synthesize richer spatial features, thereby improving the accuracy of the final saliency prediction.
  • the deconvolution step size 2 is taken as an example, which means that each layer of deconvolution can increase the resolution by twice the original. Since the resolutions of the three spatial feature maps output by S311 are with Therefore, the three subsequent deconvolution modules respectively include 2, 3, and 4 deconvolution layers to obtain multiple feature maps with a resolution of w ⁇ h. At the same time, since each feature map of the w ⁇ h multiple feature maps with the same resolution is usually still a multi-channel feature map at this time, a 1x1 convolutional layer is connected at the end of each deconvolution module. Combining each feature map of the multiple feature maps to output multiple single-channel feature maps can greatly reduce the amount of data and calculations in subsequent modules.
  • the output resolution after operation 430 is The feature map of is deconvolved twice, and finally a layer of 1x1 convolutional layer is connected to obtain a single-channel feature map with a resolution of w ⁇ h.
  • operation 460 includes 2 layers of deconvolutional layers and 1 layer of 1x1 Convolution, the number of channels corresponding to the output feature map are: 64, 32, and 1.
  • the output resolution after operation 440 is The feature map of is subjected to three deconvolutions, and finally a layer of 1x1 convolutional layer is added to obtain another single-channel feature map with a resolution of w ⁇ h.
  • operation 470 includes 3 deconvolution layers and 1 With layer 1x1 convolution, the number of channels corresponding to the output feature maps are: 128, 64, 32, and 1, respectively.
  • operation 480 the output resolution after operation 450 is The feature map of 480 is deconvolved four times, and finally a layer of 1 ⁇ 1 convolutional layer is connected to obtain another single-channel feature map with a resolution of w ⁇ h.
  • the 4-layer deconvolution layer and the deconvolution layer included in operation 480 With 1 layer 1x1 convolution, the number of channels corresponding to the output feature maps are: 256, 128, 64, 32, and 1.
  • multiple single-channel feature maps of each frame of image obtained after S311 and S312 are combined into a multi-channel feature map of each frame of image (for example, operation 490 in FIG. 7).
  • a multi-channel feature map of each frame of image for example, operation 490 in FIG. 7.
  • FIG. 7 for three single-channel feature maps obtained after operations 460-480, they can be combined into three-channel feature maps.
  • the obtained multi-channel feature map of each frame of image is used as the input of the next RNN structure.
  • the final operation 490 combines the outputs of the three deconvolution modules to obtain a three-channel feature map.
  • the three-channel feature map is not only the output of the CNN structure, but also the next RNN structure in each One moment of input.
  • the RNN structure is used to process the multi-channel feature map of the multi-frame image to obtain the single-channel saliency map of each frame of image.
  • x t is the input at the current time t
  • h t-1 represents the received state input of the previous node
  • y t is the output at the current time
  • h t is the input to the next node.
  • the RNN structure used in the embodiment of the present application takes LSTM as an example, but the embodiment of the present application is not limited to this.
  • LSTM is a special kind of RNN. Simply put, LSTM can perform better in longer sequences than ordinary RNNs.
  • LSTM contains three gating signals: input gate, forget gate, and output gate.
  • the input gate will be determined based on x t and h t-1 which information is added to the state c t-1 in order to generate a new state c t; forgotten door role c t-1 is not before the circulating neural network forget information; an output based on the latest state of the gate c t, the output h of a time t-1 and the current input x t output to determine the time T h.
  • the multi-channel feature map is obtained after inputting it into the CNN structure; at this time, the multi-channel feature map corresponding to each frame of image is sequentially input to the LSTM Input at the corresponding time to get a multi-channel output, and finally connect a 1x1 convolutional layer to get the final single-channel saliency map of each frame of image.
  • the t can be any positive integer
  • the t-th frame image is converted to a multi-channel feature map after passing through the CNN structure, and then at the t-th moment, the t-th frame
  • the multi-channel feature map of the image is input to the LSTM structure, and the cell state c t-1 and the hidden state h t-1 output at the t-1 time are combined to output the multi-channel processed feature corresponding to the t-th frame image
  • the figure also outputs the cell state c t and the hidden state h t at the t-th time.
  • the three-channel feature map corresponding to the image 400 output by operation 490 in FIG. 7 is taken as an example.
  • the three-channel feature map is the input 500 in FIG.
  • the feature map 500 is sequentially input to the corresponding time of the LSTM.
  • the three-channel feature map 510 is output, the resolution is still w ⁇ h, and finally a 1x1 convolutional layer is connected, and finally the single-channel saliency of each frame of image is obtained Figure 520.
  • the size of the cyclic layer of the LSTM in the embodiment of the present application can be set according to actual applications, and can be set to any value.
  • the size of the loop layer of the LSTM can be set to 10, or the sequence length of the training data can be set to 10. That is, during training, input 10 consecutive frames of images in each iteration, first extract spatial features through CNN structure, LSTM extract temporal features, and finally synthesize spatiotemporal features to generate a saliency map of the video sequence.
  • the training data should select a video saliency calibration data set in the YUV format.
  • the training data used may be Simon Fraser University (SFU) human eye tracking public video data set.
  • SFU Simon Fraser University
  • This data set is a calibrated human eye saliency video data set, all in YUV format.
  • the training set, validation set, and test set can be divided according to a ratio of 8:1:1.
  • data in the YUV format is taken as an example for description, and the YUV format may include YUV444, YUV422, YUV420 and other formats.
  • the UV component has a down-sampling operation, which causes the resolution of the data in each channel to be inconsistent.
  • the up-sampling operation can be performed on the two channels of UV, so that the resolution of the three channels of YUV is the same.
  • the up-sampling method can select the bilinear difference method.
  • the Y channel can also be down-sampled to unify the three channels to the resolution of UV, thereby solving the problem of different resolutions of the three channels of YUV.
  • the following alternatives can be adopted: (1) When the network is trained and used, the input image size is down-sampled (down-sampling generally does not affect The distribution and motion information of objects in the scene), which greatly reduces the amount of data calculated by the network to increase the speed; (2) For the YUV video format used in the embodiments of this application, considering that the Y channel represents brightness information, it contains most of the Object category and motion information, and the human eye is most sensitive to brightness information. Therefore, it is possible to train and predict only the Y channel, reduce the amount of data, and improve real-time performance. If the above two real-time enhancement operations are adopted, the delay can be reduced to 1 frame, which greatly improves the real-time performance of the system and enhances the interactive experience.
  • the each frame of image is determined ROI position.
  • the values of all positions in the saliency map are between 0 and 1, and this value represents the predicted value of the human eye’s attention to the area.
  • the degree of concentration of the distribution of heat that is, the degree of attention of the human eye
  • the heat in Fig. 12 is relatively concentrated, and the object category and movement information are obvious; while in Fig. 13, the heat is scattered, there are no obvious objects and movements in the scene, and the picture is relatively flat.
  • the ROI position of each saliency map can be determined in different ways, where the ROI position may include the center position of the ROI and the range of the ROI.
  • the ROI position may include the center position of the ROI and the range of the ROI.
  • multiple methods can be used for determining the center position of the ROI.
  • determining the center position of the ROI may include: outputting the position with the largest pixel value in the saliency map as the center coordinates of the ROI, for example, may be output to the blur sharpening module 240.
  • determining the center position of the ROI may also include: determining the coordinates of multiple points with pixel values greater than or equal to a first preset value in each frame of saliency map; The average value is determined as the output of the center coordinates of the ROI of each frame of image, for example, it can be output to the blur sharpening module 240.
  • the first preset value can be set according to actual applications, and any value between 1 and 0 of the pixel value distribution range of the saliency map can be set, for example, it can be set to 0.8, but the embodiment of the present application is not limited to this .
  • the second method can reduce the random error of the pixel point distribution in the saliency map by averaging, so that the obtained ROI area will be more accurate.
  • other methods may be used to determine the center position of the ROI, but the embodiment of the present application is not limited to this.
  • determining the range of the ROI may include: determining the size of the ROI according to the size of each frame of image.
  • the size of the ROI can usually be set to a preset multiple (for example, 1/4) of the size of each frame of image, for example, it can be achieved by setting the length and width of the ROI to half the size of the image.
  • determining the range of the ROI may also include: determining the coordinates of multiple points in the saliency map corresponding to each frame of image that the pixel value is greater than or equal to the second preset value; determining two points among the multiple points, the two The absolute value of the difference between the abscissa of a point and/or the absolute value of the difference between the ordinate is the largest; the size of the ROI is determined according to the absolute value of the difference between the abscissas of the two points and/or the absolute value of the difference between the ordinates .
  • the size of the ROI can be adjusted according to at least one of the following steps: if the absolute value of the difference between the abscissas of the two points is greater than or equal to the preset length, then the size of the ROI can be adjusted according to the difference between the abscissas of the two points.
  • the ratio of the absolute value to the preset length determines the length of the ROI of each frame of the image.
  • the preset length can be expanded to the length of the ROI according to the ratio; if the absolute value of the abscissa difference between the two points is less than the The preset length, the preset length is determined as the length of the ROI of each frame of image; if the absolute value of the ordinate difference of the two points is greater than or equal to the preset width, then the ordinate difference of the two points can be determined
  • the ratio of the absolute value of the value to the preset width determines the width of the ROI of each frame of image.
  • the preset width can be expanded to the width of the ROI according to the ratio; if the absolute value of the ordinate difference between the two points is less than The preset width is determined as the width of the ROI of each frame of image.
  • the second preset value can be set according to actual applications, and can be set to any value between 1 to 0 in the pixel value distribution range of the saliency map, for example, it can be set to 0.7, but this embodiment does not Limited to this.
  • the distribution of all points in the saliency map whose pixel value is greater than or equal to the second preset value is calculated, and every two of these points that meet the requirements of the second preset value are calculated.
  • the absolute value of the difference between the horizontal and vertical coordinates corresponding to each point describes the range of the heat distribution; for example, the absolute value of the difference between the horizontal and vertical coordinates can be compared with the default ROI size. If it is larger than the default size of the ROI, then Expand the size of the ROI; if it is smaller than the default size of the ROI, you can choose to reduce the default size of the ROI, or directly use the default size of the ROI.
  • the size of the ROI can be enlarged or reduced according to the ratio of the horizontal and vertical coordinates to the default size of the ROI.
  • the image processing method of the embodiment of the present application considers that in order to solve the bandwidth problem, the area outside the ROI will be blurred through filtering to reduce high-frequency information of the image, increase the compression rate, and ultimately reduce the bandwidth, so the ROI position
  • the determination of is particularly important for the image processing process.
  • the eye-tracking design and algorithm are generally used to detect and give it, but this will cause a great delay, making the actual position of the human eye and the obtained ROI position misaligned, so that it is impossible to observe high-quality Video image.
  • a visual attention prediction model based on deep learning is used instead of eye tracking equipment.
  • the embodiments of the present application are based on deep learning, through a calibrated large-scale video saliency data set, using a network model combining CNN and RNN to extract intra-frame spatial information and inter-frame motion information (temporal information), respectively, to obtain video
  • the spatio-temporal characteristics of the sequence enable end-to-end video saliency prediction.
  • the size of the sequence numbers of the above-mentioned processes does not mean the order of execution, and the execution order of each process should be determined by its function and internal logic, and should not correspond to the embodiments of the present application.
  • the implementation process constitutes any limitation.
  • the image processing apparatus 600 includes: a first processing module 610, a second processing module 620, and a determining module 630.
  • the first processing module 610 is configured to: use a CNN structure to process each frame of the multi-frame image to obtain a multi-channel feature map of each frame of the image;
  • the second processing module 620 uses Yu:
  • the RNN structure is used to process the multi-channel feature map of the multi-frame image to obtain the single-channel saliency map of each frame of image;
  • the determining module 630 is configured to: correspond to each frame of image To determine the ROI position of each frame of image.
  • the image processing apparatus 600 may correspond to executing the method 300 in the embodiment of the present application, and the foregoing and other operations and/or functions of the various modules in the apparatus 600 are used to implement FIGS. 1 to 1 respectively.
  • FIGS. 1 to 1 For the sake of brevity, the corresponding process of each method in 13 will not be repeated here.
  • each embodiment of the present application may also be implemented based on a memory and a processor, each memory is used to store instructions for executing the method of each embodiment of the present application, and the processor executes the foregoing instructions so that the device executes each implementation of the present application.
  • Example method Example method.
  • the image processing apparatus 700 includes: a processor 710 and a memory 720.
  • the processor 710 and the memory 720 are connected through a bus system.
  • the memory 720 is used to store instructions, and the processor 710 is used to execute instructions stored in the memory 720.
  • the processor 710 can call the program code stored in the memory 720 to perform the following operations: use a CNN structure to process each frame of the multi-frame image to obtain the multi-channel feature map of each frame of the image; use the RNN structure to perform the following operations: The multi-channel feature map of the multi-frame image is processed to obtain a single-channel saliency map of the image of each frame; the ROI position of the image of each frame is determined according to the saliency map corresponding to the image of each frame.
  • the processor 710 is configured to: perform continuous convolution and/or pooling operations on each frame of image to obtain multiple spatial feature maps of each frame of image, so The multiple spatial feature maps have different resolutions; each spatial feature map in the multiple spatial feature maps is subjected to deconvolution and/or convolution operations to obtain multiple single-channel features of each frame of image Figure, the multiple single-channel feature maps have the same resolution; combining the multiple single-channel feature maps of each frame of image into a multi-channel feature map of each frame of image.
  • the processor 710 is configured to: perform continuous convolution and pooling operations on each frame of image according to a preset network model structure, so as to obtain at least three parts of each frame of image.
  • Three spatial feature maps, the at least three spatial feature maps have different resolutions.
  • the preset network model structure is a VGG-16 structure
  • the processor 710 is configured to: perform five sets of convolution pools on each frame of the image according to the VGG-16 structure To obtain three spatial feature maps of each frame of image, wherein the five sets of convolution pooling operations include 13 layers of convolution.
  • the resolution of each frame of image is w ⁇ h
  • the resolutions of the three spatial feature maps are: with
  • the processor 710 is configured to: perform a deconvolution operation on each of the spatial feature maps to obtain multiple feature maps with the same resolution; Each feature map is subjected to a convolution operation to obtain the multiple single-channel feature maps.
  • the resolution of each frame of image is w ⁇ h, and the resolutions of the multiple feature maps are all w ⁇ h.
  • the deconvolution step size in the deconvolution operation is set to 2.
  • the processor 710 is configured to: use a 1*1 convolutional layer for each feature map to obtain the multiple single-channel feature maps.
  • the RNN structure is an LSTM structure.
  • the processor 710 is configured to: sequentially input the multi-channel feature maps of the multi-frame images into the LSTM structure in chronological order to output the multi-channel corresponding to each frame of image The processed feature map; a 1*1 convolution layer is used for the processed feature map to obtain the single-channel saliency map of each frame of image.
  • the processor 710 is configured to: at the t-th time, input the multi-channel feature map of the t-th frame of image into the LSTM structure, and according to the t-1th time Output cell state c t-1 and hidden state h t-1 , output the multi-channel processed feature map corresponding to the t-th frame image, and output the cell state c t and hidden state h t at the t-th moment, t is any positive integer.
  • the cyclic layer size of the LSTM structure is set to 10.
  • the processor 710 is configured to: determine the position of the ROI of each frame of image according to the pixel values of different positions in the saliency map corresponding to each frame of image, and the ROI The position includes the center coordinates and/or size of the ROI.
  • the processor 710 is configured to: determine the coordinates of the point with the largest pixel value in the saliency map corresponding to each frame of image as the center of the ROI of each frame of image coordinate.
  • the processor 710 is configured to: determine the coordinates of multiple points whose pixel values are greater than or equal to a first preset value in the saliency map corresponding to each frame of image; The average value of coordinates with more points is determined as the center coordinates of the ROI of each frame of image.
  • the processor 710 is configured to determine the size of the ROI of each frame of image according to the size of each frame of image.
  • the processor 710 is configured to: set the size of the ROI of each frame of image to 1/4 of the size of each frame of image.
  • the processor 710 is configured to: determine the coordinates of multiple points whose pixel values are greater than or equal to a second preset value in the saliency map corresponding to each frame of image; For two points in the two points, the absolute value of the abscissa difference of the two points is the largest and/or the absolute value of the ordinate difference is the largest; according to the sum of the absolute values of the abscissa difference of the two points/ Or the absolute value of the ordinate difference determines the size of the ROI of each frame of image.
  • the processor 710 is configured to perform at least one of the following steps: if the absolute value of the abscissa difference of the two points is greater than or equal to a preset length, according to the two points The ratio of the absolute value of the abscissa difference of a point to the preset length determines the length of the ROI of each frame of image; if the absolute value of the abscissa difference of the two points is less than the preset Length, the preset length is determined as the length of the ROI of each frame of the image; if the absolute value of the ordinate difference of the two points is greater than or equal to the preset width, according to the length of the two points The ratio of the absolute value of the ordinate difference to the preset width determines the width of the ROI of each frame of image; if the absolute value of the ordinate difference of the two points is less than the preset width, The preset width is determined as the width of the ROI of each frame of image.
  • the image processing apparatus 700 may correspond to the apparatus 600 in the embodiment of the present application, and may correspond to the execution of the method 300 in the embodiment of the present application.
  • Other operations and/or functions are used to implement the corresponding procedures of the methods in FIG. 1 to FIG. 13, and are not repeated here for brevity.
  • the image processing device of the embodiment of the present application considers that in order to solve the bandwidth problem in the image processing process, the area outside the ROI will be blurred through filtering to reduce the high frequency information of the image, increase the compression rate, and finally reduce the bandwidth. Therefore, the determination of the position of the ROI is particularly important for the image processing process.
  • a visual attention prediction model based on deep learning is adopted in the embodiments of this application, which can predict the region of interest of the human eye in real time according to the content of the video, reduce system delay, and improve the real-time performance of the system. Practicability, and improves the portability between the platforms in the system.
  • the embodiments of the present application are based on deep learning, through a calibrated large-scale video saliency data set, using a network model combining CNN and RNN to extract intra-frame spatial information and inter-frame motion information (temporal information), respectively, to obtain video
  • the spatio-temporal characteristics of the sequence enable end-to-end video saliency prediction.
  • processors mentioned in the embodiments of the present application may be a central processing unit (Central Processing Unit, CPU), or may also be other general-purpose processors, digital signal processors (Digital Signal Processors, DSPs), and application-specific integrated circuits ( Application Specific Integrated Circuit (ASIC), Field Programmable Gate Array (FPGA) or other programmable logic devices, discrete gates or transistor logic devices, discrete hardware components, etc.
  • the general-purpose processor may be a microprocessor or the processor may also be any conventional processor or the like.
  • the memory mentioned in the embodiments of the present application may be a volatile memory or a non-volatile memory, or may include both volatile and non-volatile memory.
  • the non-volatile memory can be read-only memory (Read-Only Memory, ROM), programmable read-only memory (Programmable ROM, PROM), erasable programmable read-only memory (Erasable PROM, EPROM), and electrically available Erase programmable read-only memory (Electrically EPROM, EEPROM) or flash memory.
  • the volatile memory may be a random access memory (Random Access Memory, RAM), which is used as an external cache.
  • RAM static random access memory
  • DRAM dynamic random access memory
  • DRAM synchronous dynamic random access memory
  • DDR SDRAM Double Data Rate Synchronous Dynamic Random Access Memory
  • Enhanced SDRAM, ESDRAM Enhanced Synchronous Dynamic Random Access Memory
  • Synchronous Link Dynamic Random Access Memory Synchronous Link Dynamic Random Access Memory
  • DR RAM Direct Rambus RAM
  • the processor is a general-purpose processor, DSP, ASIC, FPGA or other programmable logic device, discrete gate or transistor logic device, or discrete hardware component
  • the memory storage module
  • the embodiments of the present application also provide a computer-readable storage medium on which instructions are stored, and when the instructions are run on a computer, the computer executes the methods of the foregoing method embodiments.
  • An embodiment of the present application also provides a computing device, which includes the above-mentioned computer-readable storage medium.
  • the embodiments of this application can be applied to aircraft, especially in the field of unmanned aerial vehicles.
  • FIG. 16 shows a schematic block diagram of a movable platform 800 according to an embodiment of the present application.
  • the movable platform 800 includes: a body 810; a power system 820, which is provided in the body 810, and is used to provide power to the movable platform 800; and one or more processors 830 are used to execute this Apply the method 300 of the embodiment.
  • the processor 830 may include the image processing device 600 of the embodiment of the present application; optionally, the movable platform 800 may also include: an image acquisition device for acquiring images, so that the processor 830 can monitor the acquired images Perform processing, for example, perform any of the above-mentioned image processing methods on the captured image.
  • the movable platform 800 in the embodiment of the present invention can refer to any movable device that can be moved in any suitable environment, for example, in the air (for example, a fixed-wing aircraft, a rotary-wing aircraft, or a fixed-wing aircraft or a rotary-wing aircraft without a fixed-wing aircraft or a rotary-wing aircraft.
  • Aircraft without rotors underwater (for example, ships or submarines), on land (for example, cars or trains), space (for example, space planes, satellites, or probes), and any combination of the above various environments.
  • the movable device may be an airplane, such as a drone (Unmanned Aerial Vehicle, referred to as "UAV" for short).
  • UAV Unmanned Aerial Vehicle
  • the body 810 may also be referred to as a body.
  • the body may include a center frame and one or more arms connected to the center frame, and the one or more arms extend radially from the center frame.
  • the tripod is connected to the fuselage and is used to support the UAV during landing.
  • the power system 820 may include an electronic governor (referred to as an ESC for short), one or more propellers, and one or more motors corresponding to the one or more propellers, where the motors are connected between the electronic governor and the propellers, The motor and the propeller are arranged on the corresponding arms; the electronic speed governor is used to receive the driving signal generated by the flight controller, and provide a driving current to the motor according to the driving signal to control the rotation speed of the motor.
  • the motor is used to drive the propeller to rotate, thereby providing power for the flight of the UAV, which enables the UAV to achieve one or more degrees of freedom of movement.
  • the motor may be a DC motor or an AC motor.
  • the motor can be a brushless motor or a brush motor.
  • the image acquisition device includes a photographing device (for example, a camera, a video camera, etc.) or a visual sensor (for example, a monocular camera or a dual/multi-lens camera, etc.).
  • a photographing device for example, a camera, a video camera, etc.
  • a visual sensor for example, a monocular camera or a dual/multi-lens camera, etc.
  • the embodiment of the present application also proposes an unmanned flying system including an unmanned aerial vehicle.
  • the unmanned aerial system 900 including the drone will be described below in conjunction with FIG. 17.
  • a rotorcraft is taken as an example for description.
  • the unmanned aerial system 900 may include a UAV 910, a carrier 920, a display device 930, and a remote control device 940.
  • UAV 910 may include a power system 950, a flight control system 960, and a frame 970.
  • the UAV 910 can wirelessly communicate with the remote control device 940 and the display device 930.
  • the frame 970 may include a fuselage and a tripod (also referred to as a landing gear).
  • the fuselage may include a center frame and one or more arms connected to the center frame, and the one or more arms extend radially from the center frame.
  • the tripod is connected to the fuselage to support the UAV 910 when it is landing.
  • the power system 950 may include an electronic speed governor (referred to as an ESC) 951, one or more propellers 953, and one or more motors 952 corresponding to the one or more propellers 953, wherein the motor 952 is connected to the electronic speed governor Between the 951 and the propeller 953, the motor 952 and the propeller 953 are arranged on the corresponding arms; the electronic governor 951 is used to receive the driving signal generated by the flight controller 960, and provide a driving current to the motor 952 according to the driving signal to control The speed of the motor 952.
  • the motor 952 is used to drive the propeller to rotate, so as to provide power for the flight of the UAV 910.
  • the power enables the UAV 910 to realize one or more degrees of freedom of movement.
  • the motor 952 may be a DC motor or an AC motor.
  • the motor 952 may be a brushless motor or a brush motor.
  • the flight control system 960 may include a flight controller 961 and a sensing system 962.
  • the sensing system 962 is used to measure the attitude information of the UAV.
  • the sensing system 962 may include, for example, a gyroscope, an electronic compass, an IMU (Inertial Measurement Unit), a visual sensor (for example, a monocular camera or a dual/multiple camera, etc.), and a GPS (Global Positioning System, Global Positioning System). ), at least one of sensors such as barometer and visual inertial navigation odometer.
  • the flight controller 961 is used to control the flight of the UAV 910. For example, it can control the flight of the UAV 910 according to the attitude information measured by the sensor system 962.
  • the carrier 920 can be used to carry a load 980.
  • the load 980 may be a photographing device (for example, a camera, a video camera, etc.).
  • the embodiment of the present application is not limited thereto.
  • the carrier may also be used to carry weapons or other loads. Carrying equipment.
  • the display device 930 is located on the ground end of the unmanned aerial system 900, can communicate with the UAV 910 in a wireless manner, and can be used to display the attitude information of the UAV 910.
  • the load 980 is a photographing device
  • the image photographed by the photographing device may also be displayed on the display device 930.
  • the display device 930 may be an independent device, or may be provided in the remote control device 940.
  • the above-mentioned receiving and decoding module 260 may be installed on a display device, and the display device is configured to display the image after the blur and sharpening process.
  • the remote control device 940 is located on the ground end of the unmanned aerial system 900, and can communicate with the UAV 910 in a wireless manner for remote control of the UAV 910.
  • the remote control device may be, for example, a remote control or a remote control device installed with an APP (Application) for controlling the UAV, such as a smart phone, a tablet computer, and the like.
  • APP Application
  • receiving user input through a remote control device may refer to controlling the UAV through an input device such as a dial, buttons, keys, and a joystick on the remote control, or a user interface (UI) on the remote control device.
  • UI user interface
  • the embodiments of the present invention can be applied to other vehicles with cameras, such as virtual reality (VR)/augmented reality (AR) glasses and other devices.
  • VR virtual reality
  • AR augmented reality
  • circuits, sub-circuits, and sub-units in each embodiment of the present application is only illustrative. A person of ordinary skill in the art may be aware that the circuits, sub-circuits, and sub-units of the examples described in the embodiments disclosed herein can be further divided or combined.
  • the above-mentioned embodiments it may be implemented in whole or in part by software, hardware, firmware, or any combination thereof.
  • software it can be implemented in the form of a computer program product in whole or in part.
  • the computer program product includes one or more computer instructions.
  • the computer can be a general-purpose computer, a special-purpose computer, a computer network, or other programmable devices.
  • Computer instructions can be stored in a computer-readable storage medium, or transmitted from one computer-readable storage medium to another computer-readable storage medium.
  • computer instructions can be transmitted from a website, computer, server, or data center through a cable (such as Coaxial cable, optical fiber, digital subscriber line (Digital Subscriber Line, DSL) or wireless (such as infrared, wireless, microwave, etc.) transmission to another website site, computer, server or data center.
  • the computer-readable storage medium may be any available medium that can be accessed by a computer or a data storage device such as a server or data center integrated with one or more available media. Available media can be magnetic media (for example, floppy disks, hard drives, tapes), optical media (for example, high-density digital video discs (Digital Video Disc, DVD)), or semiconductor media (for example, solid state disks (Solid State Disk, SSD)) )Wait.
  • each embodiment of the present application is described by taking a total bit width of 16 bits as an example, and each embodiment of the present application may be applicable to other bit widths.
  • the size of the sequence numbers of the above-mentioned processes does not mean the order of execution, and the execution order of each process should be determined by its function and internal logic, and should not correspond to the embodiments of the present application.
  • the implementation process constitutes any limitation.
  • B corresponding to A means that B is associated with A, and B can be determined according to A.
  • determining B based on A does not mean that B is determined only based on A, and B can also be determined based on A and/or other information.
  • the disclosed system, device, and method can be implemented in other ways.
  • the device embodiments described above are merely illustrative, for example, the division of the units is only a logical function division, and there may be other divisions in actual implementation, for example, multiple units or components can be combined or It can be integrated into another system, or some features can be ignored or not implemented.
  • the displayed or discussed mutual coupling or direct coupling or communication connection may be indirect coupling or communication connection through some interfaces, devices or units, and may be in electrical, mechanical or other forms.
  • the units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, they may be located in one place, or they may be distributed on multiple network units. Some or all of the units can be selected according to actual needs to achieve the objectives of the solutions of the embodiments.
  • the functional units in the various embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units may be integrated into one unit.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Evolutionary Computation (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Computing Systems (AREA)
  • Software Systems (AREA)
  • Mathematical Physics (AREA)
  • Molecular Biology (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Image Analysis (AREA)

Abstract

一种图像处理的方法、装置、可移动平台以及系统。该图像处理的方法,包括:采用CNN结构,对多帧图像中的每帧图像进行处理,以获得所述每帧图像的多通道特征图;采用RNN结构,对所述多帧图像的多通道特征图进行处理,以获得所述每帧图像的单通道的显著性图;根据所述每帧图像对应的显著性图,确定所述每帧图像的感兴趣区域位置。本申请实施例的图像处理的方法、装置、可移动平台以及系统,可以更加精确的确定ROI的位置,减少系统时延,提高系统实时性。

Description

图像处理的方法、装置、可移动平台以及系统
版权申明
本专利文件披露的内容包含受版权保护的材料。该版权为版权所有人所有。版权所有人不反对任何人复制专利与商标局的官方记录和档案中所存在的该专利文件或者该专利披露。
技术领域
本申请涉及图像处理领域,尤其涉及一种图像处理的方法、装置、可移动平台以及系统。
背景技术
在图传应用中,通常将拍摄到的图像和视频进行实时传输,这需要占用较大的带宽。为了减少对图传资源的占用,可以采用滤波的方式对图像进行虚化。例如,可以对图像中的感兴趣区域(region of interest,ROI)保持像素点的原值,对其他区域基于相同或不同的滤波半径,通过例如均值滤波或高斯滤波等方式来减少高频信息。
在上述方案中,如果不能及时并准确的确定ROI位置,则会使得地面端人眼无法观察到高质量图像,而是观察到虚化的图像。因此,如何精确地确定ROI位置是目前亟待解决的问题。
发明内容
本申请提供了一种图像处理的方法、装置、可移动平台以及系统,可以更加精确的确定ROI的位置,减少系统时延,提高系统实时性。
第一方面,提供了一种图像处理的方法,包括:采用卷积神经网络(CNN)结构,对多帧图像中的每帧图像进行处理,以获得所述每帧图像的多通道特征图;采用循环神经网络(RNN)结构,对所述多帧图像的多通道特征图进行处理,以获得所述每帧图像的单通道的显著性图;根据所述每帧图像对应的显著性图,确定所述每帧图像的感兴趣区域(ROI)位置。
第二方面,提供了一种图像处理的装置,用于执行上述第一方面或第 一方面的任意可能的实现方式中的方法。具体地,该装置包括用于执行上述第一方面或第一方面的任意可能的实现方式中的方法的单元。
第三方面,提供了一种图像处理的装置,包括:存储单元和处理器,该存储单元用于存储指令,该处理器用于执行该存储器存储的指令,并且当该处理器执行该存储器存储的指令时,该执行使得该处理器执行第一方面或第一方面的任意可能的实现方式中的方法。
第四方面,提供了一种计算机可读介质,用于存储计算机程序,该计算机程序包括用于执行第一方面或第一方面的任意可能的实现方式中的方法的指令。
第五方面,提供了一种包括指令的计算机程序产品,当计算机运行所述计算机程序产品的所述指时,所述计算机执行上述第一方面或第一方面的任意可能的实现方式中的图像处理的方法。具体地,该计算机程序产品可以运行于上述第三方面的图像处理的装置上。
第六方面,提供了一种可移动平台,包括:机体;动力系统,设于所述机体内,用于为所述可移动平台提供动力;以及一个或者多个处理器,用于执行上述第一方面或第一方面的任意可能的实现方式中的图像处理的方法。
第七方面,提供了一种系统,包括上述第六方面的可移动平台和显示设备,所述可移动平台与所述显示设备有线连接或无线连接。
附图说明
图1是图像处理系统的示意性框图。
图2是本申请实施例中的图像中ROI位置的示意图。
图3是图1中图像处理系统中的数据流的示意图。
图4是本申请实施例的图像处理系统的示意性框图。
图5是本申请实施例的图像处理的方法的示意性流程图。
图6是本申请实施例的图像处理的方法流程的示意图。
图7是本申请实施例的采用CNN结构对图像进行处理的示意图。
图8-9是RNN结构的一般示意图。
图10是LSTM结构的一般示意图。
图11是本申请实施例的采用RNN结构对图像进行处理的示意图。
图12是本申请实施例的热度分布集中的显著图的示意图。
图13是本申请实施例的热度分布分散的显著图的示意图。
图14是本申请实施例的图像处理装置的示意性框图。
图15是本申请实施例的另一图像处理装置的示意性框图。
图16是本申请实施例的可移动平台的示意性框图。
图17是本申请实施例的无人飞行系统的示意图。
具体实施方式
下面将结合附图,对本申请实施例中的技术方案进行描述。
除非另有定义,本文所使用的所有的技术和科学术语与属于本申请的技术领域的技术人员通常理解的含义相同。本文中在本申请的说明书中所使用的术语只是为了描述具体的实施例的目的,不是旨在于限制本申请。
在图传应用中,通常将拍摄到的图像和视频进行实时传输,这需要占用较大的带宽。为了减少对图传资源的占用,可以采用滤波的方式对图像进行虚化。例如,可以对图像中的感兴趣区域(region of interest,ROI)保持像素点的原值,对其他区域基于相同或不同的滤波半径,通过例如均值滤波或高斯滤波等方式来减少高频信息。
在上述方案中,通常采用眼球追踪的设备和方法,实时确定每一帧ROI的中心位置,从而根据ROI的位置和大小,在ROI区域外采用渐进的滤波方案,实现了平滑的虚化效果以及极大的带宽压缩。具体地,图1示出了图像处理系统100的示意性框图。如图1所示,该系统100可以包括:镜头采集模块110、图像信号处理(Image Signal Processing,ISP)模块120、虚化(Blur)锐化(Sharpen)模块130、编码与传输模块140、接收与解码模块150以及眼球追踪模块160。
如图1所示,镜头采集模块110采集到的图像首先经过ISP模块120进行处理。具体地,镜头采集模块110采集到的图像可以经感光元件转换为电信号后,被传至ISP模块120进行处理,从而转化为数字图像。其中,镜头采集模块110可以指摄像头,对应的,镜头采集模块220采集的图像可以是指摄像头通过拍照获取到的一帧或者多帧图像,也可以是指通过录像获取到的视频图像,本申请对此不作限定。
ISP模块120输出的图像信号结合ROI的位置信息,输入到虚化锐化模块130中,以对视频图像中不同区域进行相应处理。具体地,机器视觉、图 像处理中,被处理的图像以方框、圆、椭圆、不规则多边形等方式勾勒出需要处理的区域,称为ROI。该ISP模块120可以指处理器,或者处理电路,本申请实施例并不限于此。
虚化锐化模块130根据ROI的位置,结合带宽需求,在ROI内外分别进行sharpen以及blur的操作,在提升ROI内图像视频质量的同时,极大的减小图传时的带宽占用。例如,图2为图像中ROI位置的示意图。如图2所示,ROI内部为Sharpen区域,外侧为Blur区域,从内到外依次分为多层的原因是为了实现渐进虚化的效果。也就是说,ROI区域内进行Sharpen操作,图像细节得到增强,画质提高;而ROI外侧由内向外逐渐模糊,可以达到平滑地虚化效果。
虚化锐化模块130将处理后的图像发送至编码与传输模块140进行编码,并从无人机端传输到地面端;在地面端,接收与解码模块150接收编码数据进行处理后,显示给用户,使得用户可以观察到局部高质量的图像,同时,眼球追踪模块160会根据人眼的移动得到其关注区域,即ROI,使得ROI位置信息传输到虚化锐化模块130。其中,编码与传输模块140可以包括编码器,还可以包括发射器;接收与解码模块150可以包括解码器,还可以包括接收器,但本申请实施例并不限于此。
在如图1所示的系统100中,由于虚化锐化模块130需要根据ROI位置来确定算法应用范围,系统100中ROI位置由眼球追踪模块160包括的设备和算法提供。具体地,图3为系统100中数据流的示意图,由图3所示的数据流可知,系统100中存在多处时延,例如,图传往返延时(即D1包括的编码、发送、解码以及上行反馈过程)、眼球追踪算法延时(即D2和D3)以及Blur和Sharpen处理算法延时(即D4)等。其中,由于眼球追踪模块160在地面端,无人机与地面端往返传输距离较远,传输时间长,使得与其相关的时延D1、D2和D3较大,也就导致系统100的延时很高。并且,由于上述延时的存在,还可能会造成ROI位置的错位,使得地面端人眼无法观察到高质量图像,而是观察到虚化的图像。
针对上述问题,本申请提供了一种用于图像处理的方法和系统,基于深度学习的视觉注意力预测方案,准确预测ROI的中心位置,从而避免上述延时过程,提高整个系统的实时性;同时,可以省去眼球追踪设备,提高实用性和可移植性。
图4示出了本申请实施例的图像处理系统200的示意性框图。如图4所示,该系统200包括镜头采集模块210、ISP模块220、视觉显著性预测模块230、虚化(Blur)锐化(Sharpen)模块240、编码与传输模块250以及接收与解码模块260。
与图1所示的系统100类似,如图4所示,首先由系统200中的镜头采集模块210将采集到的图像经过ISP模块220进行处理。ISP模块220处理后输出的图像信号再结合ROI的位置信息,输入到虚化锐化模块240中,以对视频图像中不同区域进行相应处理。
但是与图1所示的系统100不同的是,系统200中的ROI位置信息由视觉显著性预测模块230基于深度学习的结果进行确定,即由视觉显著性预测模块230输出ROI的位置,例如输出ROI中心位置,以作为虚化锐化模块240的参数。之后,与图1所示的系统100类似,经过编码与传输模块250的编码处理,然后传输到地面端;再由接收与解码模块260处理后,显示给用户。
应理解,镜头采集模块210可以适用于镜头采集模块110的相关描述,例如,该镜头采集模块210可以为摄像头;ISP模块220可以适用于ISP模块120的相关描述,例如,该ISP模块220可以为ISP模块处理电路;虚化锐化模块240可以适用于虚化锐化模块130的相关描述;编码与传输模块250可以适用于编码与传输模块140的相关描述,接收与解码模块260可以适用于接收与解码模块150的相关描述,例如,编码与传输模块250可以包括编码器和发射器,接收与解码模块260可以包括接收器和解码器,为了简洁,在此不再赘述。
因此,本申请实施例的系统200,基于深度学习,由视频显著性预测模块确定ROI,无需人工设计特征和进行复杂计算,而可以端到端实时预测视频的视觉注意力区域,避免眼球追踪方案造成的延时问题。在一实施例中,视频显著性预测模块为处理电路。
下面对视觉显著性预测模块230确定ROI的过程进行详细描述。
图5示出了本申请实施例的图像处理的方法300的示意性流程图。可选地,方法300可以由图像处理装置执行,例如,该图像处理装置可以为上述视觉显著性预测模块230,但本申请实施例并不限于此。
如图5所示,该方法300包括:S310,采用卷积神经网络(Convolutional  Neural Network,CNN)结构,对多帧图像中的每帧图像进行处理,以获得所述每帧图像的多通道特征图(feature-map);S320,采用循环神经网络(recurrent neural network,RNN)结构,对所述多帧图像的多通道特征图进行处理,以获得所述每帧图像的单通道的显著性图(saliency-map);S330,根据所述每帧图像对应的显著性图,确定所述每帧图像的ROI位置。
深度学习起源于对神经网络的研究。1943年,为了理解大脑工作原理进而设计人工智能,一种简化版大脑细胞的概念——神经元被首次提出。此后,神经网络在此基础上提出并得到快速发展。随着卷积神经网络(Convolutional Neural Network,CNN)等多种网络结构的提出,极大的减少了网络参数量,训练速度以及精度得到明显提高,因此,神经网络在图像领域得到快速发展和广泛应用。
例如,人类能够迅速地选取视野中的关键部分,选择性地将视觉处理资源分配给这些视觉显著的区域,对应的,在计算机视觉领域,理解和模拟人类视觉系统的这种注意力机制,得到了学术界的大力关注,并显示出了广阔的应用前景。近年来,随着计算能力的增强以及大规模显著性检测数据集的建立,深度学习技术逐渐成为视觉注意力机制计算和建模的主要手段。
在显著性检测领域中,有很多工作研究了如何模拟人类在观看图像时的视觉注意力机制。但是当前基于深度学习的视觉显著性预测方案主要集中在静态的图像领域,对视频序列的研究较少的主要原因是,视频序列的帧间运动信息的提取,需要进行特征的设计以及计算量较大,使得视频显著性预测的进展缓慢。也就是说,目目前关于动态场景下人类如何分配视觉注意力的研究相对较少,但动态视觉注意力机制在人类日常行为中却更为普遍且更为重要。
动态人眼关注点检测的研究工作使用到的模型,通过将静态显著性特征和时间域信息(如光流场、时域差分等)相结合,检测动态场景下的视觉注意力,其中大部分工作都可被看作是已有静态显著性模型的基础上考虑运动信息后的扩展。这些模型严重依赖于特征工程,因而模型的性能受到了手工设计特征的限制。
考虑到不同网络结构具有的不同特性,本申请实施例的图像处理的方法,将CNN与RNN结合使用。具体地,CNN的主要功能特点是,对高维 特征的提取能力很强,广泛应用在图像视觉领域,如目标检测、人脸识别等实际应用中,并产生了极大的成功。相对于传统的这些检测算法,CNN等深度学习算法不需要去人为的选择特征,而是通过学习训练网络的方式去提取这些特征,然后将这些提取的特征去产生后面的决策结果,从而实现分类、识别等功能。
RNN的主要特点是,可以挖掘数据中的时序信息以及语义信息的深度表达能力,并在语音识别、语言模型、机器翻译以及时序分析等方面实现了突破。也就是说,循环神经网络用于处理和预测序列数据。
因此,在本申请实施例的图像处理的方法中,利用CNN对图像的强大特征提取能力,配合RNN对时间序列的处理能力,可以对采集到的视频序列进行处理,以得到每一帧图像的显著性区域,进而确定ROI的位置。这样可以避免现有系统中地面端的眼球追踪模块以及图传往返过程等造成的延时,极大的提升原系统的实时性;提高系统的实用性以及可移植性。
下面将结合图6-11对本申请实施例的方法300进行详细描述。
图6示出了图像处理的方法300的流程的示意图。如图6所示,首先获取待处理的多帧图像,该多帧图像可以指任意待处理视频数据中的多帧图像。具体地,该待处理的视频数据可以指图4所示的系统200中ISP模块220输出的数据,例如,可以为ISP模块220输出的yuv格式的视频序列。例如,该待处理的视频数据可以包括T帧图像,即图6中的第1帧图像至第T帧图像,其中,第t帧图像表示其中任意一帧图像。
待处理的每帧图像分别输入CNN结构中,采用CNN结构对多每帧图像进行处理,对应输出每帧图像的多通道特征图;该多通道的特征图输入RNN结构,例如,图6中以采用长短时记忆网络(Long short-term memory,LSTM)结构为例,对应输出每帧图像的多通道特征图;经过合并,最后获得每帧图像的显著性图,由该显著性图则可以获得每帧图像中的ROI区域的位置。
下面首先描述CNN结构的前向传播过程:即在S310中,采用CNN结构,对多帧图像中的每帧图像进行处理,以获得所述每帧图像的多通道特征图。具体地,该S310可以包括:S311,对所述每帧图像进行连续的卷积和/或池化操作,以获取所述每帧图像的多张空间特征图,所述多张空间特征图具有不同分辨率;S312,对所述多张空间特征图中的每张空间特征图进行反卷积和/或卷积操作,以获取所述每帧图像的多张单通道特征图,所述多张单 通道特征图具有相同分辨率;S313,将所述每帧图像的所述多张单通道特征图组合为所述每帧图像的多通道特征图。
在本申请实施例中,图7示出了采用CNN结构对图像进行处理的示意图。如图7所示,在S311中,对于每一帧视频图像(例如图7中的图像400),经过连续的卷积和池化操作(例如图7所示的操作410-450),以提取每一帧图像的空间特征图(例如图7中经过操作430-450之后分别获得的图像),其中,所述多张空间特征图具有不同分辨率。
具体地,在该CNN结构的卷积池化处理部分,可以根据实际应用的需要,合理选择神经网络结构,例如,本申请以使用预训练的VGG-16神经网络的卷积层部分的网络结构为例进行说明,VGG-16网络可以很好地提取每一帧图像的空间特征,但本申请并不限于此。例如,除了采用VGG-16网络的卷积部分以外,根据具体的问题,也可以选择其他同级别的深层卷积神经网络的卷积块来代替该部分,如ResNet、GoogLeNet等。
VGG-Net是由牛津大学计算机视觉组(Visual Geometry Group)等人研发出了新的深度卷积神经网络。VGG-Net探索了卷积神经网络的深度与其性能之间的关系,成功地构筑了16~19层深的卷积神经网络,证明了增加网络的深度能够在一定程度上影响网络最终的性能,使错误率大幅下降,同时拓展性又很强,迁移到其它图片数据上的泛化性也非常好。目前,VGG被广泛用来提取图像特征。
在本申请实施例中,如图7所示,使用预训练的VGG-16神经网络的卷积层部分的网络结构,在该CNN结构的卷积池化处理部分,可以包括5组共13层卷积,以获取每帧图像的多张空间特征图,例如本申请实施例中以获得每帧图像的三张空间特征图为例,该三张空间特征图具有不同分辨率。
具体地,以任意一帧输入的图像400为例,假设该图像400的分辨率为w×h,并且该图像400可以具有多通道,例如,该图像400可以为图4所示的ISP模块220输出的yuv格式的视频序列中任意一帧图像,具有三通道。该图像400经过一系列的卷积和/或池化操作,例如,如图7所示,可以依次经过操作410-450。其中,操作410包括两次卷积操作,分辨率仍然为w×h;操作420包括两次卷积和一次池化,其中的池化操作可以使得分辨率降为
Figure PCTCN2020092827-appb-000001
操作430包括三次卷积和一次池化,其中的池化操作可以使得分辨 率降为
Figure PCTCN2020092827-appb-000002
经过该操作430之后输出一张空间特征图;将经过操作430的图像再经过操作440,该操作440包括三次卷积和一次池化,其中的池化操作可以使得分辨率降为
Figure PCTCN2020092827-appb-000003
经过该操作440之后再输出一张空间特征图;将经过操作440的图像再经过操作450,该操作450包括三次卷积和一次池化,其中的池化操作可以使得分辨率降为
Figure PCTCN2020092827-appb-000004
经过该操作450之后输出最后一张空间特征图。经过操作410-450可以输出三张分辨率不同的空间特征图,其通道数)分别为256、512和512;另外,从输入图像400开始,经过的5组卷积操作(410-450)的feature-map的通道数分别为:64、128、256、512和512。
应理解,这里以获得三张分辨率分别为
Figure PCTCN2020092827-appb-000005
Figure PCTCN2020092827-appb-000006
的特征图为例,但本申请实施例并不限于此,本申请实施例中获取的空间特征图的数量和分辨率可以根据实际应用进行设置,例如,可以选择其他分辨率的特征图,或者也可以获取更多或者更少数量的空间特征图。
在S312中,对S311之后获得的多张分辨率不同的空间特征图中的每张空间特征图进行反卷积和/或卷积操作,以获取每帧图像的多张单通道特征图,该多张单通道特征图具有相同分辨率。具体地,可以对获取到的多张空间特征图中每张空间特征图进行反卷积操作,以获得分辨率相同的多张特征图;对该多张特征图中每张特征图进行卷积操作,以获得多张单通道特征图(例如经过图7中的操作460-480之后分别获取到的图像)。
应理解,经过S311中包括的5组卷积(5个卷积块)的输出五张张空间特征图的分辨率大小,分别为输入视频图像400的分辨率大小的1、1/2、1/4、1/8和1/16倍。为了得到像素级(pixel-wise)分辨率大小的salency map,需要对由S311获得的图像feature-map进行上采样,设置反卷积层是为了上采样提高分辨率。
具体地,这里以在S311中描述得获得的三张分辨率不同的空间特征图为例,则对应需要设置三个反卷积模块(对应图7中操作460-480),以分别对S311中后三组卷积模块最后分别输出的三张空间特征图进行上采样。这 里选择后三组卷积模块的原因是,对多层较高级别的特征进行融合,能够综合得到更丰富的空间特征,从而提升最终显著性预测的准确率。
本申请实施例中以将反卷积步长设置为2为例,这就意味着每层反卷积可以将分辨率扩大为原来的2倍。由于S311输出的三张空间特征图的分辨率大小分别为
Figure PCTCN2020092827-appb-000007
Figure PCTCN2020092827-appb-000008
因此三张后接的反卷积模块分别包含2、3、4个反卷积层,以得到分辨率为w×h的多张特征图。同时,由于该分辨率相同的w×h的多张特征图中每张特征图此时通常仍然为多通道特征图,因此,在每个反卷积模块最后接一层1x1的卷积层,将该多张特征图中每张特征图进行融合,输出多张单通道特征图,可以大大降低后续模块中的数据量和计算量。
具体地,在操作460中,将经过操作430后输出的分辨率为
Figure PCTCN2020092827-appb-000009
的特征图进行两次反卷积,最后接一层1x1的卷积层,以获得分辨率为w×h的单通道特征图,其中,操作460包括的2层反卷积层和1层1x1卷积,对应输出的特征图的通道数分别为:64、32和1。在操作470中,将经过操作440后输出的分辨率为
Figure PCTCN2020092827-appb-000010
的特征图进行三次反卷积,最后接一层1x1的卷积层,以获得另一张分辨率为w×h的单通道特征图,其中,操作470包括的3层反卷积层和1层1x1卷积,对应输出的特征图的通道数分别为:128、64、32和1。在操作480中,将经过操作450后输出的分辨率为
Figure PCTCN2020092827-appb-000011
的特征图进行四次反卷积,最后接一层1x1的卷积层,以获得再一张分辨率为w×h的单通道特征图,其中,操作480包括的4层反卷积层和1层1x1卷积,对应输出的特征图的通道数分别为:256、128、64、32和1。
应理解,图7中仅以获得三张单通道特征图为例进行说明,但本申请实施例并不限于此,例如,可以根据实际应用设置更多或者更少数量的单通道特征图。
在S313中,将经过S311和S312之后获得的每帧图像的多张单通道特征图组合为该每帧图像的多通道特征图(例如图7中的操作490)。例如,如图7所示,对于经过操作460-480之后获得三张单通道特征图,可以组合成三通道特征图。
应理解,根据上述S311-S313,将获得的每帧图像的多通道特征图作为接下来的RNN结构的输入。例如,如图7所示,最后的操作490将经过3个反卷积模块的输出组合在一起,得到三通道特征图,该三通道特征图既是CNN结构的输出,也是接下来RNN结构在每一个时刻的输入。具体地,在S320中,采用RNN结构,对所述多帧图像的多通道特征图进行处理,以获得所述每帧图像的单通道的显著性图。
下面结合图8和图9对RNN的一般形式进行描述。如图8所示,对于一般的RNN结构,x t为当前t时刻的输入,h t-1表示接收到的上一个节点的状态输入;y t为当前时刻的输出,h t为传递到下一个节点的状态输出。当输入为一个序列时,例如输入视频的图像序列,那么可以得到RNN的如图9所示的展开形式。
考虑到为了解决长序列训练中的梯度消失和梯度爆炸的问题,本申请实施例中使用的RNN结构以LSTM为例,但本申请实施例并不限于此。LSTM是一种特殊的RNN,简单来说,就是相比普通的RNN,LSTM能够在更长的序列中有更好的表现。
LSTM的一般形式如图10所示,相比于RNN只有一个传递状态h t,LSTM有两个传输状态,一个是细胞状态(cell state)c t,一个是隐藏状态(hidden state)h t。具体地,LSTM包含三个门控信号:输入门、遗忘门、输出门。其中,输入门会根据x t和h t-1决定哪些信息加入到状态c t-1中,以生成新的状态c t;遗忘门的作用是让循环神经网络忘记之前c t-1中没有用的信息;输出门会根据最新的状态c t、上一时刻的输出h t-1和当前的输入x t来决定该时刻的输出h t
LSTM前向传播的过程及各个门控信号的公式定义如下:
z=tanh(W z[h t-1,x t])……输入值
i=sigmoid(W i[h t-1,x t])……输入门
f=sigmoid(W f[h t-1,x t])……遗忘门
o=sigmoid(W o[h t-1,x t])……输出门
c t=f·c t-1+i·z……新状态
h t=o·tanh c t……输出
在本申请实施例中,如图6所示,对于视频中的每一帧图像,输入到 CNN结构后得到多通道特征图;此时,将每帧图像对应的多通道特征图依次输入到LSTM的对应时刻输入,则得到多通道的输出,最后接一个1x1的卷积层,得到最终每一帧图像的单通道的显著图。其中,以第t个时刻的第t帧图像为例,该t可以为任意正整数,第t帧图像经过CNN结构后的到多通道特征图,再在第t个时刻,将该第t帧图像的多通道特征图输入至LSTM结构,并结合第t-1个时刻输出的细胞状态c t-1和隐藏状态h t-1,输出该第t帧图像对应的多通道的处理后的特征图,另外,也输出第t个时刻的细胞状态c t和隐藏状态h t
具体地,如图11所示,以图7中操作490输出的与图像400对应的三通道特征图为例,该三通道特征图为图11中的输入500,将每帧图像对于的三通道特征图500依次输入到LSTM对应时刻,经过LSTM结构后,输出三通道特征图510,分辨率仍然为w×h,最后接一个1x1的卷积层,最终得到每一帧图像的单通道的显著图520。
应理解,本申请实施例中的LSTM的循环层的大小(size)可以根据实际应用进行设置,并且可以设置为任意值。例如,可以将LSTM的循环层的大小设置为10,或者训练数据的序列长度为10。即在训练时,每次迭代输入连续10帧图像,先经过CNN结构提取空间特征,LSTM提取时间特征,最终综合时空特征生成视频序列的显著性图。
可选地,如果本申请实施例中的视觉注意力预测模块处理的是YUV格式的数据,那么训练数据应当选取YUV格式的视频显著性标定数据集。例如,使用的训练数据可以是西蒙弗雷泽大学(Simon Fraser University,SFU)人眼追踪公共视频数据集。该数据集为标定的人眼显著性视频数据集,均为YUV格式。其中,训练集、验证集、测试集可以按照8:1:1的比例划分。
应理解,本申请实施例中以YUV格式的数据为例进行描述,该YUV格式可以包括YUV444、YUV422、YUV420等格式。其中,因为YUV422、YUV420两种数据格式下,UV分量存在着下采样的操作,导致数据在每个通道的分辨率不一致。此时,可以对UV两通道进行上采样操作,使得YUV三通道的分辨率相同。例如,该上采样方法可以选择双线性差值法,经过上采样过程,三通道保持同一分辨率,作为视觉注意力预测模块网络的输入。或者,也可以通过对Y通道进行下采样,以将三通道统一到UV的分辨率,从而解决YUV三通道分辨率不同的问题。
可选地,为了进一步提升系统的实时性,减少系统时延,可以采用以下可选方案:(1)在网络的训练和使用时,对输入的图像大小进行下采样(下采样一般不会影响场景中物体的分布以及运动信息),极大的减少网络计算的数据量,以提升速度;(2)对于本申请实施例采用的YUV视频格式,考虑到Y通道表示亮度信息,包含了大部分物体类别及运动信息,而且人眼对亮度信息最敏感,因此,可以只对Y通道进行训练及预测,减小数据量,提升实时性。如果采用上述两种实时性提升操作,可以将延迟做到1帧,极大的提升系统的实时性,提升交互体验。
在本申请实施例中,经过上述CNN结构和RNN结构的处理,对于获得的每帧图像的显著性图,在S330中,根据所述每帧图像对应的显著性图,确定所述每帧图像的ROI位置。
具体地,显著性图中所有位置的值均为0~1之间,该数值表征了人眼对该区域的关注程度的预测值,数值越大,在显著性图中越亮,表示人眼对该位置关注的可能性越高。根据不同场景中物体的类别信息和运动信息,热度(即人眼的关注度)分布的集中程度有所不同。例如,图12中的热度较为集中,物体类别和运动信息明显;而图13中热度分散,场景中没有明显的物体及运动,画面较为平坦。
鉴于上述对显著性图的热度分布的分析,可以采用不同的方式确定每个显著性图的ROI位置,这里的ROI位置可以包括ROI的中心位置及ROI的范围。具体地,对于确定ROI的中心位置,可以采用多种方式。例如,确定ROI的中心位置可以包括:将显著性图中像素值最大的位置,作为ROI中心坐标输出,例如,可以输出到虚化锐化模块240。或者,为了减小随机误差,确定ROI的中心位置还可以包括:确定每帧显著性图中像素值大于或者等于第一预设值的多个点的坐标;将该多个点多的坐标的平均值确定为每帧图像的ROI的中心坐标输出,例如,可以输出到虚化锐化模块240。其中,该第一预设值可以根据实际应用进行设置,并且可以设置显著性图像素值分布范围1至0之间的任意值,例如,可以设置为0.8,但本申请实施例并不限于此。
在上述两种确定ROI中心坐标的方式中,第二种方式通过取平均的方式,能够减小显著性图中像素点分布的随机误差,使得得到的ROI区域会更准确。或者还可以采用其它方式确定ROI的中心位置,但本申请实施例并 不限于此。
对于确定ROI范围,同样可以采用多种方式。例如,确定ROI的范围可以包括:根据每帧图像的尺寸,确定ROI的尺寸大小。例如,通常可以将ROI的尺寸设置为每帧图像的尺寸的预设倍数(例如1/4),比如可以通过将ROI的长和宽分别设置为图像尺寸一半大小来实现。
但是考虑到在热度分布较广的场景中,人眼可能的关注范围较大,因此,还可以对ROI的大小进行调整。即确定ROI的范围还可以包括:确定每帧图像对应的显著性图中像素值大于或者等于第二预设值的多个点的坐标;在该多个点中确定两个点,该两个点的横坐标之差的绝对值最大和/或纵坐标之差的绝对值最大;根据该两个点的横坐标之差的绝对值和/或纵坐标之差的绝对值,确定ROI的尺寸。
具体地,可以根据下述步骤中的至少一个调整ROI的尺寸:若该两个点的横坐标差值的绝对值大于或者等于预设长度,那么可以根据该两个点的横坐标差值的绝对值与该预设长度的比值,确定每帧图像的ROI的长度,例如,可以按照该比值,扩大预设长度为ROI的长度;若该两个点的横坐标差值的绝对值小于该预设长度,将该预设长度确定为每帧图像的ROI的长度;若该两个点的纵坐标差值的绝对值大于或者等于预设宽度,那么可以根据该两个点的纵坐标差值的绝对值与预设宽度的比值,确定每帧图像的ROI的宽度,例如,可以按照该比值,扩大预设宽度为ROI的宽度;若该两个点的纵坐标差值的绝对值小于该预设宽度,将该预设宽度确定为每帧图像的ROI的宽度。
应理解,该第二预设值可以根据实际应用进行设置,并且可以设置为显著性图像素值分布范围1至0之间的任意值,例如,可以设置为0.7,但本申请实施例并不限于此。
这样,根据设定的第二预设值,计算出显著性图中像素值大于或者等于该第二预设值的全部点的分布,通过计算这些满足第二预设值要求的点中每两个点对应的横纵坐标之差的绝对值,来描述热度分布的范围;例如,可以将将该横纵坐标之差的绝对值分别与默认的ROI大小进行比较,若大于ROI默认尺寸,则扩大ROI的大小;若小于ROI默认尺寸,则可以选择缩小ROI默认尺寸,或者直接采用ROI默认尺寸。其中,可以根据横纵坐标与ROI默认尺寸之比,确定扩大或者缩小ROI的尺寸。通过上述过程,可 以根据实际情况,更加精确的确定ROI的位置,并且根据不同场景调整ROI的大小,从而提升体验。
因此,本申请实施例的图像处理的方法,考虑到当前为了解决带宽问题,会通过滤波进行ROI之外区域的虚化,减少图像高频信息,提高压缩率,最终减小带宽,所以ROI位置的确定对图像处理过程尤为重要。为了获得ROI的位置,一般采用眼球追踪设计和算法来检测并给出,但这会造成极大的延迟,使得人眼实际观察到的位置与得到的ROI位置错位,从而无法观察到高质量的视频图像。而本申请实施例中采用基于深度学习的视觉注意力预测模型,替代了眼球追踪设备,可以根据视频的内容,实时预测人眼感兴趣区域,可以避免眼球追踪等一些列延时过程,提高系统的实时性和实用性,且使得系统中各平台之间的可移植性提高。
另外,考虑到目前基于深度学习的视觉显著性预测方案主要应用于静态图像的处理,对于应用于视频序列的情况下,视频序列的帧间运动信息的提取需要进行特征的设计以及计算量较大,会导致视频显著性预测的进展缓慢。所以,本申请实施例基于深度学习,通过已标定的大规模视频显著性数据集,采用CNN和RNN相结合的网络模型,分别提取帧内空间信息和帧间运动信息(时间信息),得到视频序列的时空特征,实现端到端视频显著性预测。
应理解,在本申请的各种实施例中,上述各过程的序号的大小并不意味着执行顺序的先后,各过程的执行顺序应以其功能和内在逻辑确定,而不应对本申请实施例的实施过程构成任何限定。
上文中结合图1至图13,详细描述了根据本申请实施例的图像处理的方法,下面将结合图14至图16,描述根据本申请实施例的图像处理的装置。
如图14所示,根据本申请实施例的图像处理的装置600包括:第一处理模块610、第二处理模块620以及确定模块630。具体地,所述第一处理模块610用于:采用CNN结构,对多帧图像中的每帧图像进行处理,以获得所述每帧图像的多通道特征图;所述第二处理模块620用于:采用RNN结构,对所述多帧图像的多通道特征图进行处理,以获得所述每帧图像的单通道的显著性图;所述确定模块630用于:根据所述每帧图像对应的显著性图,确定所述每帧图像的ROI位置。
应理解,根据本申请实施例的图像处理的装置600可对应于执行本申请实施例中的方法300,并且装置600中的各个模块的上述和其它操作和/或功能分别为了实现图1至图13中的各个方法的相应流程,为了简洁,在此不再赘述。
应理解,本申请各实施例的装置还可以基于存储器和处理器实现,各存储器用于存储用于执行本申请个实施例的方法的指令,处理器执行上述指令,使得装置执行本申请各实施例的方法。
具体地,如图15所示,根据本申请实施例的图像处理的装置700包括:处理器710和存储器720。具体地,处理器710和存储器720通过总线系统相连,该存储器720用于存储指令,该处理器710用于执行该存储器720存储的指令。处理器710可以调用存储器720中存储的程序代码执行以下操作:采用CNN结构,对多帧图像中的每帧图像进行处理,以获得所述每帧图像的多通道特征图;采用RNN结构,对所述多帧图像的多通道特征图进行处理,以获得所述每帧图像的单通道的显著性图;根据所述每帧图像对应的显著性图,确定所述每帧图像的ROI位置。
可选地,作为一个实施例,所述处理器710用于:对所述每帧图像进行连续的卷积和/或池化操作,以获取所述每帧图像的多张空间特征图,所述多张空间特征图具有不同分辨率;对所述多张空间特征图中的每张空间特征图进行反卷积和/或卷积操作,以获取所述每帧图像的多张单通道特征图,所述多张单通道特征图具有相同分辨率;将所述每帧图像的所述多张单通道特征图组合为所述每帧图像的多通道特征图。
可选地,作为一个实施例,所述处理器710用于:根据预设网络模型结构,对所述每帧图像进行连续的卷积和池化操作,以获取所述每帧图像的至少三张空间特征图,所述至少三张空间特征图具有不同分辨率。
可选地,作为一个实施例,所述预设网络模型结构为VGG-16结构,所述处理器710用于:根据所述VGG-16结构,对所述每帧图像进行五组卷积池化操作,以获取所述每帧图像的三张空间特征图,其中,所述五组卷积池化操作包括13层卷积。
可选地,作为一个实施例,所述每帧图像的分辨率为w×h,所述三张空间特征图的分辨率分别为:
Figure PCTCN2020092827-appb-000012
Figure PCTCN2020092827-appb-000013
可选地,作为一个实施例,所述处理器710用于:对所述每张空间特征 图进行反卷积操作,以获得分辨率相同的多张特征图;对所述多张特征图中每张特征图进行卷积操作,以获得所述多张单通道特征图。
可选地,作为一个实施例,所述每帧图像的分辨率为w×h,所述多张特征图的分辨率均为w×h。
可选地,作为一个实施例,所述反卷积操作中的反卷积步长设置为2。
可选地,作为一个实施例,所述处理器710用于:对所述每张特征图采用1*1的卷积层,获得所述多张单通道特征图。
可选地,作为一个实施例,所述RNN结构为LSTM结构。
可选地,作为一个实施例,所述处理器710用于:将所述多帧图像的多通道特征图按照时间顺序依次输入至所述LSTM结构,以输出所述每帧图像对应的多通道的处理后的特征图;对所述处理后的特征图采用1*1的卷积层,以获得所述每帧图像的单通道的显著性图。
可选地,作为一个实施例,所述处理器710用于:在第t个时刻,将所述第t帧图像的多通道特征图输入至所述LSTM结构,并根据第t-1个时刻输出的细胞状态c t-1和隐藏状态h t-1,输出所述第t帧图像对应的多通道的处理后的特征图以及输出第t个时刻的细胞状态c t和隐藏状态h t,t为任意正整数。
可选地,作为一个实施例,所述LSTM结构的循环层大小设置为10。
可选地,作为一个实施例,所述处理器710用于:根据所述每帧图像对应的显著性图中不同位置的像素值,确定所述每帧图像的所述ROI位置,所述ROI位置包括所述ROI的中心坐标和/或尺寸。
可选地,作为一个实施例,所述处理器710用于:将所述每帧图像对应的显著性图中像素值最大的点的坐标,确定为所述每帧图像的所述ROI的中心坐标。
可选地,作为一个实施例,所述处理器710用于:确定所述每帧图像对应的显著性图中像素值大于或者等于第一预设值的多个点的坐标;将所述多个点多的坐标的平均值确定为所述每帧图像的所述ROI的中心坐标。
可选地,作为一个实施例,所述处理器710用于:根据所述每帧图像的尺寸,确定所述每帧图像的所述ROI的尺寸。
可选地,作为一个实施例,所述处理器710用于:将所述每帧图像的所述ROI的尺寸设置为所述每帧图像的尺寸的1/4。
可选地,作为一个实施例,所述处理器710用于:确定所述每帧图像对应的显著性图中像素值大于或者等于第二预设值的多个点的坐标;确定所述多个点中的两个点,所述两个点的横坐标差值的绝对值最大和/或纵坐标差值的绝对值最大;根据所述两个点的横坐标差值的绝对值和/或纵坐标差值的绝对值,确定所述每帧图像的所述ROI的尺寸。
可选地,作为一个实施例,所述处理器710用于执行以下步骤中的至少一个:若所述两个点的横坐标差值的绝对值大于或者等于预设长度,根据所述两个点的横坐标差值的绝对值与所述预设长度的比值,确定所述每帧图像的所述ROI的长度;若所述两个点的横坐标差值的绝对值小于所述预设长度,将所述预设长度确定为所述每帧图像的所述ROI的长度;若所述两个点的纵坐标差值的绝对值大于或者等于预设宽度,根据所述两个点的纵坐标差值的绝对值与所述预设宽度的比值,确定所述每帧图像的所述ROI的宽度;若所述两个点的纵坐标差值的绝对值小于所述预设宽度,将所述预设宽度确定为所述每帧图像的所述ROI的宽度。
应理解,根据本申请实施例的图像处理的装置700可对应于本申请实施例中的装置600,并可对应于执行本申请实施例中的方法300,并且装置700中的各个部分的上述和其它操作和/或功能分别为了实现图1至图13中的各个方法的相应流程,为了简洁,在此不再赘述。
因此,本申请实施例的图像处理装置,考虑到当前为了解决图像处理过程中带宽问题,会通过滤波进行ROI之外区域的虚化,减少图像高频信息,提高压缩率,最终减小带宽,所以ROI位置的确定对图像处理过程尤为重要。为了能够准确的获得ROI的位置,本申请实施例中采用基于深度学习的视觉注意力预测模型,可以根据视频的内容,实时预测人眼感兴趣区域,减少系统时延,提高系统的实时性和实用性,且使得系统中各平台之间的可移植性提高。
另外,考虑到目前基于深度学习的视觉显著性预测方案主要应用于静态图像的处理,对于应用于视频序列的情况下,视频序列的帧间运动信息的提取需要进行特征的设计以及计算量较大,会导致视频显著性预测的进展缓慢。所以,本申请实施例基于深度学习,通过已标定的大规模视频显著性数据集,采用CNN和RNN相结合的网络模型,分别提取帧内空间信息和帧间运动信息(时间信息),得到视频序列的时空特征,实现端到端视频显著性 预测。
应理解,本申请实施例中提及的处理器可以是中央处理单元(Central Processing Unit,CPU),还可以是其他通用处理器、数字信号处理器(Digital Signal Processor,DSP)、专用集成电路(Application Specific Integrated Circuit,ASIC)、现成可编程门阵列(Field Programmable Gate Array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等。通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。
还应理解,本申请实施例中提及的存储器可以是易失性存储器或非易失性存储器,或可包括易失性和非易失性存储器两者。其中,非易失性存储器可以是只读存储器(Read-Only Memory,ROM)、可编程只读存储器(Programmable ROM,PROM)、可擦除可编程只读存储器(Erasable PROM,EPROM)、电可擦除可编程只读存储器(Electrically EPROM,EEPROM)或闪存。易失性存储器可以是随机存取存储器(Random Access Memory,RAM),其用作外部高速缓存。通过示例性但不是限制性说明,许多形式的RAM可用,例如静态随机存取存储器(Static RAM,SRAM)、动态随机存取存储器(Dynamic RAM,DRAM)、同步动态随机存取存储器(Synchronous DRAM,SDRAM)、双倍数据速率同步动态随机存取存储器(Double Data Rate SDRAM,DDR SDRAM)、增强型同步动态随机存取存储器(Enhanced SDRAM,ESDRAM)、同步连接动态随机存取存储器(Synchlink DRAM,SLDRAM)和直接内存总线随机存取存储器(Direct Rambus RAM,DR RAM)。
需要说明的是,当处理器为通用处理器、DSP、ASIC、FPGA或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件时,存储器(存储模块)集成在处理器中。
应注意,本文描述的存储器旨在包括但不限于这些和任意其它适合类型的存储器。
本申请实施例还提供一种计算机可读存储介质,其上存储有指令,当指令在计算机上运行时,使得计算机执行上述各方法实施例的方法。
本申请实施例还提供一种计算设备,该计算设备包括上述计算机可读存储介质。
本申请实施例可以应用在飞行器,尤其是无人机领域。
可选的,本申请实施例还提出了一种可移动平台。具体地,图16示出了本申请实施例的可移动平台800的示意性框图。如图16所示,该可移动平台800包括:机体810;动力系统820,设于该机体810内,用于为该可移动平台800提供动力;一个或者多个处理器830,用于执行本申请实施例的方法300。其中,该处理器830可以包括本申请实施例的图像处理装置600;可选地,该可移动平台800还可以包括:图像采集装置,用于采集图像,以使得处理器830对采集到的图像进行处理,例如对采集到的图像执行上述任一项图像处理方法。
本发明实施例中的可移动平台800可以指任意可移动设备,该可移动设备可以在任何合适的环境下移动,例如,空气中(例如,定翼飞机、旋翼飞机,或既没有定翼也没有旋翼的飞机)、水中(例如,轮船或潜水艇)、陆地上(例如,汽车或火车)、太空(例如,太空飞机、卫星或探测器),以及以上各种环境的任何组合。该可移动设备可以是飞机,例如无人机(Unmanned Aerial Vehicle,简称为“UAV”)。
机体810也可以称为机身,该机身可以包括中心架以及与中心架连接的一个或多个机臂,一个或多个机臂呈辐射状从中心架延伸出。脚架与机身连接,用于在UAV着陆时起支撑作用。
动力系统820可以包括电子调速器(简称为电调)、一个或多个螺旋桨以及与一个或多个螺旋桨相对应的一个或多个电机,其中电机连接在电子调速器与螺旋桨之间,电机和螺旋桨设置在对应的机臂上;电子调速器用于接收飞行控制器产生的驱动信号,并根据驱动信号提供驱动电流给电机,以控制电机的转速。电机用于驱动螺旋桨旋转,从而为UAV的飞行提供动力,该动力使得UAV能够实现一个或多个自由度的运动。应理解,电机可以是直流电机,也可以交流电机。另外,电机可以是无刷电机,也可以有刷电机。
所述图像采集装置包括拍摄设备(例如,相机、摄像机等)或视觉传感器(例如,单目摄像头或双/多目摄像头等)。
可选地,本申请实施例还提出了一种包含无人机的无人飞行系统。具体地,以下将结合图17对包含无人机的无人飞行系统900进行说明。本实施例以旋翼飞行器为例进行说明。
无人飞行系统900可以包括UAV 910、载体920、显示设备930和遥控 装置940。其中,UAV 910可以包括动力系统950、飞行控制系统960和机架970。UAV 910可以与遥控装置940和显示设备930进行无线通信。
机架970可以包括机身和脚架(也称为起落架)。机身可以包括中心架以及与中心架连接的一个或多个机臂,一个或多个机臂呈辐射状从中心架延伸出。脚架与机身连接,用于在UAV 910着陆时起支撑作用。
动力系统950可以包括电子调速器(简称为电调)951、一个或多个螺旋桨953以及与一个或多个螺旋桨953相对应的一个或多个电机952,其中电机952连接在电子调速器951与螺旋桨953之间,电机952和螺旋桨953设置在对应的机臂上;电子调速器951用于接收飞行控制器960产生的驱动信号,并根据驱动信号提供驱动电流给电机952,以控制电机952的转速。电机952用于驱动螺旋桨旋转,从而为UAV 910的飞行提供动力,该动力使得UAV 910能够实现一个或多个自由度的运动。应理解,电机952可以是直流电机,也可以交流电机。另外,电机952可以是无刷电机,也可以有刷电机。
飞行控制系统960可以包括飞行控制器961和传感系统962。传感系统962用于测量UAV的姿态信息。传感系统962例如可以包括陀螺仪、电子罗盘、IMU(惯性测量单元,Inertial Measurement Unit)、视觉传感器(例如,单目摄像头或双/多目摄像头等)、GPS(全球定位系统,Global Positioning System)、气压计和视觉惯导里程计等传感器中的至少一种。飞行控制器961用于控制UAV 910的飞行,例如,可以根据传感系统962测量的姿态信息控制UAV 910的飞行。
载体920可以用来承载负载980。例如,当载体920为云台设备时,负载980可以为拍摄设备(例如,相机、摄像机等),本申请的实施例并不限于此,例如,载体也可以是用于承载武器或其它负载的承载设备。
显示设备930位于无人飞行系统900的地面端,可以通过无线方式与UAV 910进行通信,并且可以用于显示UAV 910的姿态信息。另外,当负载980为拍摄设备时,还可以在显示设备930上显示拍摄设备拍摄的图像。应理解,显示设备930可以是独立的设备,也可以设置在遥控装置940中。示例性的,上述接收与解码模块260可安装在显示设备上,所述显示设备用于显示进行虚化锐化处理后的图像。
遥控装置940位于无人飞行系统900的地面端,可以通过无线方式与 UAV 910进行通信,用于对UAV 910进行远程操纵。遥控装置例如可以是遥控器或者安装有控制UAV的APP(应用程序,Application)的遥控装置,例如,智能手机、平板电脑等。本申请的实施例中,通过遥控装置接收用户的输入,可以指通过遥控器上的拔轮、按钮、按键、摇杆等输入装置或者遥控装置上的用户界面(UI)对UAV进行操控。
除了上述提到的可移动设备,本发明实施例可以应用于其它具有摄像头的载具,例如虚拟现实(Virtual Reality,VR)/增强现实(Augmented Reality,AR)眼镜等设备。
应理解,本申请各实施例的电路、子电路、子单元的划分只是示意性的。本领域普通技术人员可以意识到,本文中所公开的实施例描述的各示例的电路、子电路和子单元,能够再行拆分或组合。
在上述实施例中,可以全部或部分地通过软件、硬件、固件或者其任意组合来实现。当使用软件实现时,可以全部或部分地以计算机程序产品的形式实现。计算机程序产品包括一个或多个计算机指令。在计算机上加载和执行计算机指令时,全部或部分地产生按照本申请实施例的流程或功能。计算机可以是通用计算机、专用计算机、计算机网络、或者其他可编程装置。计算机指令可以存储在计算机可读存储介质中,或者从一个计算机可读存储介质向另一个计算机可读存储介质传输,例如,计算机指令可以从一个网站站点、计算机、服务器或数据中心通过有线(例如同轴电缆、光纤、数字用户线(Digital Subscriber Line,DSL))或无线(例如红外、无线、微波等)方式向另一个网站站点、计算机、服务器或数据中心进行传输。计算机可读存储介质可以是计算机能够存取的任何可用介质或者是包含一个或多个可用介质集成的服务器、数据中心等数据存储设备。可用介质可以是磁性介质(例如,软盘、硬盘、磁带)、光介质(例如,高密度数字视频光盘(Digital Video Disc,DVD))、或者半导体介质(例如,固态硬盘(Solid State Disk,SSD))等。
应理解,本申请各实施例均是以总位宽为16位(bit)为例进行说明的,本申请各实施例可以适用于其他的位宽。
应理解,说明书通篇中提到的“一个实施例”或“一实施例”意味着与实施例有关的特定特征、结构或特性包括在本申请的至少一个实施例中。因此,在整个说明书各处出现的“在一个实施例中”或“在一实施例中”未必一定 指相同的实施例。此外,这些特定的特征、结构或特性可以任意适合的方式结合在一个或多个实施例中。
应理解,在本申请的各种实施例中,上述各过程的序号的大小并不意味着执行顺序的先后,各过程的执行顺序应以其功能和内在逻辑确定,而不应对本申请实施例的实施过程构成任何限定。
应理解,在本申请实施例中,“与A相应的B”表示B与A相关联,根据A可以确定B。但还应理解,根据A确定B并不意味着仅仅根据A确定B,还可以根据A和/或其它信息确定B。
应理解,本文中术语“和/或”,仅仅是一种描述关联对象的关联关系,表示可以存在三种关系,例如,A和/或B,可以表示:单独存在A,同时存在A和B,单独存在B这三种情况。另外,本文中字符“/”,一般表示前后关联对象是一种“或”的关系。
本领域普通技术人员可以意识到,结合本文中所公开的实施例描述的各示例的单元及算法步骤,能够以电子硬件、或者计算机软件和电子硬件的结合来实现。这些功能究竟以硬件还是软件方式来执行,取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本申请的范围。
所属领域的技术人员可以清楚地了解到,为描述的方便和简洁,上述描述的系统、装置和单元的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。
在本申请所提供的几个实施例中,应该理解到,所揭露的系统、装置和方法,可以通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如,所述单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,装置或单元的间接耦合或通信连接,可以是电性,机械或其它的形式。
所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的 部分或者全部单元来实现本实施例方案的目的。
另外,在本申请各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。
以上所述,仅为本申请的具体实施方式,但本申请的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本申请揭露的技术范围内,可轻易想到变化或替换,都应涵盖在本申请的保护范围之内。因此,本申请的保护范围应以所述权利要求的保护范围为准。

Claims (44)

  1. 一种图像处理的方法,其特征在于,包括:
    采用卷积神经网络(CNN)结构,对多帧图像中的每帧图像进行处理,以获得所述每帧图像的多通道特征图;
    采用循环神经网络(RNN)结构,对所述多帧图像的多通道特征图进行处理,以获得所述每帧图像的单通道的显著性图;
    根据所述每帧图像对应的显著性图,确定所述每帧图像的感兴趣区域(ROI)位置。
  2. 根据权利要求1所述的方法,其特征在于,所述采用卷积神经网络CNN结构,对多帧图像中的每帧图像进行处理,以获得所述每帧图像的多通道特征图,包括:
    对所述每帧图像进行连续的卷积和/或池化操作,以获取所述每帧图像的多张空间特征图,所述多张空间特征图具有不同分辨率;
    对所述多张空间特征图中的每张空间特征图进行反卷积和/或卷积操作,以获取所述每帧图像的多张单通道特征图,所述多张单通道特征图具有相同分辨率;
    将所述每帧图像的所述多张单通道特征图组合为所述每帧图像的多通道特征图。
  3. 根据权利要求2所述的方法,其特征在于,所述对所述每帧图像进行连续的卷积和/或池化操作,以获取所述每帧图像的多张空间特征图,包括:
    根据预设网络模型结构,对所述每帧图像进行连续的卷积和池化操作,以获取所述每帧图像的至少三张空间特征图,所述至少三张空间特征图具有不同分辨率。
  4. 根据权利要求3所述的方法,其特征在于,所述预设网络模型结构为VGG-16结构,
    所述根据所述预设网络模型结构,对所述每帧图像进行连续的卷积和池化操作,以获取所述每帧图像的至少三张空间特征图,包括:
    根据所述VGG-16结构,对所述每帧图像进行五组卷积池化操作,以获取所述每帧图像的三张空间特征图,其中,所述五组卷积池化操作包括13层卷积。
  5. 根据权利要求4所述的方法,其特征在于,所述每帧图像的分辨率为w×h,所述三张空间特征图的分辨率分别为:
    Figure PCTCN2020092827-appb-100001
    Figure PCTCN2020092827-appb-100002
  6. 根据权利要求2至5中任一项所述的方法,其特征在于,所述对所述多张空间特征图中的每张空间特征图进行反卷积和/或卷积操作,以获取所述每帧图像的多张单通道特征图,包括:
    对所述每张空间特征图进行反卷积操作,以获得分辨率相同的多张特征图;
    对所述多张特征图中每张特征图进行卷积操作,以获得所述多张单通道特征图。
  7. 根据权利要求6所述的方法,其特征在于,所述每帧图像的分辨率为w×h,所述多张特征图的分辨率均为w×h。
  8. 根据权利要求6或7所述的方法,其特征在于,所述反卷积操作中的反卷积步长设置为2。
  9. 根据权利要求6至8中任一项所述的方法,其特征在于,所述对所述多张特征图中每张特征图进行卷积操作,以获得所述多张单通道特征图,包括:
    对所述每张特征图采用1*1的卷积层,获得所述多张单通道特征图。
  10. 根据权利要求1至9中任一项所述的方法,其特征在于,所述RNN结构为长短时记忆网络(LSTM)结构。
  11. 根据权利要求10所述的方法,其特征在于,所述采用循环神经网络RNN结构,对所述多帧图像的多通道特征图进行处理,以获得所述每帧图像的单通道的显著性图,包括:
    将所述多帧图像的多通道特征图按照时间顺序依次输入至所述LSTM结构,以输出所述每帧图像对应的多通道的处理后的特征图;
    对所述处理后的特征图采用1*1的卷积层,以获得所述每帧图像的单通道的显著性图。
  12. 根据权利要求11所述的方法,其特征在于,所述将所述多帧图像的多通道特征图按照时间顺序依次输入至所述LSTM结构,以输出所述每帧图像对应的多通道的处理后的特征图,包括:
    在第t个时刻,将所述第t帧图像的多通道特征图输入至所述LSTM结构,并根据第t-1个时刻输出的细胞状态c t-1和隐藏状态h t-1,输出所述第t 帧图像对应的多通道的处理后的特征图以及输出第t个时刻的细胞状态c t和隐藏状态h t,t为任意正整数。
  13. 根据权利要求11或12所述的方法,其特征在于,所述LSTM结构的循环层大小设置为10。
  14. 根据权利要求1至13中任一项所述的方法,其特征在于,所述根据所述每帧图像对应的显著性图,确定所述每帧图像的感兴趣区域ROI位置,包括:
    根据所述每帧图像对应的显著性图中不同位置的像素值,确定所述每帧图像的所述ROI位置,所述ROI位置包括所述ROI的中心坐标和/或尺寸。
  15. 根据权利要求14所述的方法,其特征在于,所述根据所述每帧图像对应的显著性图中不同位置的像素值,确定所述每帧图像的所述ROI位置,所述ROI位置包括所述ROI的中心坐标和/或尺寸,包括:
    将所述每帧图像对应的显著性图中像素值最大的点的坐标,确定为所述每帧图像的所述ROI的中心坐标。
  16. 根据权利要求14所述的方法,其特征在于,所述根据所述每帧图像对应的显著性图中不同位置的像素值,确定所述每帧图像的所述ROI位置,所述ROI位置包括所述ROI的中心坐标和/或尺寸,包括:
    确定所述每帧图像对应的显著性图中像素值大于或者等于第一预设值的多个点的坐标;
    将所述多个点多的坐标的平均值确定为所述每帧图像的所述ROI的中心坐标。
  17. 根据权利要求14至16中任一项所述的方法,其特征在于,所述根据所述每帧图像对应的显著性图中不同位置的像素值,确定所述每帧图像的所述ROI位置,所述ROI位置包括所述ROI的中心坐标和/或尺寸,包括:
    根据所述每帧图像的尺寸,确定所述每帧图像的所述ROI的尺寸。
  18. 根据权利要求17所述的方法,其特征在于,所述根据所述每帧图像的尺寸,确定每帧图像的所述ROI的范围,包括:
    将所述每帧图像的所述ROI的尺寸设置为所述每帧图像的尺寸的1/4。
  19. 根据权利要求14至16中任一项所述的方法,其特征在于,所述根据所述每帧图像对应的显著性图中不同位置的像素值,确定所述每帧图像的所述ROI位置,所述ROI位置包括所述ROI的中心坐标和/或尺寸,包括:
    确定所述每帧图像对应的显著性图中像素值大于或者等于第二预设值的多个点的坐标;
    确定所述多个点中的两个点,所述两个点的横坐标差值的绝对值最大和/或纵坐标差值的绝对值最大;
    根据所述两个点的横坐标差值的绝对值和/或纵坐标差值的绝对值,确定所述每帧图像的所述ROI的尺寸。
  20. 根据权利要求19所述的方法,其特征在于,所述根据所述两个点的横坐标差值的绝对值和/或纵坐标差值的绝对值,确定所述每帧图像的所述ROI的尺寸,包括以下步骤中的至少一个:
    若所述两个点的横坐标差值的绝对值大于或者等于预设长度,根据所述两个点的横坐标差值的绝对值与所述预设长度的比值,确定所述每帧图像的所述ROI的长度;
    若所述两个点的横坐标差值的绝对值小于所述预设长度,将所述预设长度确定为所述每帧图像的所述ROI的长度;
    若所述两个点的纵坐标差值的绝对值大于或者等于预设宽度,根据所述两个点的纵坐标差值的绝对值与所述预设宽度的比值,确定所述每帧图像的所述ROI的宽度;
    若所述两个点的纵坐标差值的绝对值小于所述预设宽度,将所述预设宽度确定为所述每帧图像的所述ROI的宽度。
  21. 一种图像处理的装置,其特征在于,包括:处理器和存储器,
    所述存储器用于存储指令,
    所述处理器用于执行所述存储器器存储的指令,并且当所述处理器执行所述存储器存储的指令时,所述处理器用于:
    采用卷积神经网络(CNN)结构,对多帧图像中的每帧图像进行处理,以获得所述每帧图像的多通道特征图;
    采用循环神经网络(RNN)结构,对所述多帧图像的多通道特征图进行处理,以获得所述每帧图像的单通道的显著性图;
    根据所述每帧图像对应的显著性图,确定所述每帧图像的感兴趣区域(ROI)位置。
  22. 根据权利要求21所述的装置,其特征在于,所述处理器用于:
    对所述每帧图像进行连续的卷积和/或池化操作,以获取所述每帧图像的 多张空间特征图,所述多张空间特征图具有不同分辨率;
    对所述多张空间特征图中的每张空间特征图进行反卷积和/或卷积操作,以获取所述每帧图像的多张单通道特征图,所述多张单通道特征图具有相同分辨率;
    将所述每帧图像的所述多张单通道特征图组合为所述每帧图像的多通道特征图。
  23. 根据权利要求22所述的装置,其特征在于,所述处理器用于:
    根据预设网络模型结构,对所述每帧图像进行连续的卷积和池化操作,以获取所述每帧图像的至少三张空间特征图,所述至少三张空间特征图具有不同分辨率。
  24. 根据权利要求23所述的装置,其特征在于,所述预设网络模型结构为VGG-16结构,
    所述处理器用于:
    根据所述VGG-16结构,对所述每帧图像进行五组卷积池化操作,以获取所述每帧图像的三张空间特征图,其中,所述五组卷积池化操作包括13层卷积。
  25. 根据权利要求24所述的装置,其特征在于,所述每帧图像的分辨率为w×h,所述三张空间特征图的分辨率分别为:
    Figure PCTCN2020092827-appb-100003
    Figure PCTCN2020092827-appb-100004
  26. 根据权利要求22至25中任一项所述的装置,其特征在于,所述处理器用于:
    对所述每张空间特征图进行反卷积操作,以获得分辨率相同的多张特征图;
    对所述多张特征图中每张特征图进行卷积操作,以获得所述多张单通道特征图。
  27. 根据权利要求26所述的装置,其特征在于,所述每帧图像的分辨率为w×h,所述多张特征图的分辨率均为w×h。
  28. 根据权利要求26或27所述的装置,其特征在于,所述反卷积操作中的反卷积步长设置为2。
  29. 根据权利要求26至28中任一项所述的装置,其特征在于,所述处理器用于:
    对所述每张特征图采用1*1的卷积层,获得所述多张单通道特征图。
  30. 根据权利要求21至29中任一项所述的装置,其特征在于,所述RNN结构为长短时记忆网络(LSTM)结构。
  31. 根据权利要求30所述的装置,其特征在于,所述处理器用于:
    将所述多帧图像的多通道特征图按照时间顺序依次输入至所述LSTM结构,以输出所述每帧图像对应的多通道的处理后的特征图;
    对所述处理后的特征图采用1*1的卷积层,以获得所述每帧图像的单通道的显著性图。
  32. 根据权利要求31所述的装置,其特征在于,所述处理器用于:
    在第t个时刻,将所述第t帧图像的多通道特征图输入至所述LSTM结构,并根据第t-1个时刻输出的细胞状态c t-1和隐藏状态h t-1,输出所述第t帧图像对应的多通道的处理后的特征图以及输出第t个时刻的细胞状态c t和隐藏状态h t,t为任意正整数。
  33. 根据权利要求31或32所述的装置,其特征在于,所述LSTM结构的循环层大小设置为10。
  34. 根据权利要求21至33中任一项所述的装置,其特征在于,所述处理器用于:
    根据所述每帧图像对应的显著性图中不同位置的像素值,确定所述每帧图像的所述ROI位置,所述ROI位置包括所述ROI的中心坐标和/或尺寸。
  35. 根据权利要求34所述的装置,其特征在于,所述处理器用于:
    将所述每帧图像对应的显著性图中像素值最大的点的坐标,确定为所述每帧图像的所述ROI的中心坐标。
  36. 根据权利要求34所述的装置,其特征在于,所述处理器用于:
    确定所述每帧图像对应的显著性图中像素值大于或者等于第一预设值的多个点的坐标;
    将所述多个点多的坐标的平均值确定为所述每帧图像的所述ROI的中心坐标。
  37. 根据权利要求34至36中任一项所述的装置,其特征在于,所述处理器用于:
    根据所述每帧图像的尺寸,确定所述每帧图像的所述ROI的尺寸。
  38. 根据权利要求37所述的装置,其特征在于,所述处理器用于:
    将所述每帧图像的所述ROI的尺寸设置为所述每帧图像的尺寸的1/4。
  39. 根据权利要求34至36中任一项所述的装置,其特征在于,所述处理器用于:
    确定所述每帧图像对应的显著性图中像素值大于或者等于第二预设值的多个点的坐标;
    确定所述多个点中的两个点,所述两个点的横坐标差值的绝对值最大和/或纵坐标差值的绝对值最大;
    根据所述两个点的横坐标差值的绝对值和/或纵坐标差值的绝对值,确定所述每帧图像的所述ROI的尺寸。
  40. 根据权利要求39所述的装置,其特征在于,所述处理器用于执行以下步骤中的至少一个:
    若所述两个点的横坐标差值的绝对值大于或者等于预设长度,根据所述两个点的横坐标差值的绝对值与所述预设长度的比值,确定所述每帧图像的所述ROI的长度;
    若所述两个点的横坐标差值的绝对值小于所述预设长度,将所述预设长度确定为所述每帧图像的所述ROI的长度;
    若所述两个点的纵坐标差值的绝对值大于或者等于预设宽度,根据所述两个点的纵坐标差值的绝对值与所述预设宽度的比值,确定所述每帧图像的所述ROI的宽度;
    若所述两个点的纵坐标差值的绝对值小于所述预设宽度,将所述预设宽度确定为所述每帧图像的所述ROI的宽度。
  41. 一种计算机可读存储介质,其特征在于,其上存储有计算机程序,所述计算机程序在被执行时,实现如权利要求1至20中任一项所述的方法。
  42. 一种包含指令的计算机程序产品,其特征在于,所述指令被计算机执行时使得计算机执行如权利要求1至20中任一项所述的方法。
  43. 一种可移动平台,其特征在于,包括:
    机体;
    动力系统,设于所述机体内,所述动力系统用于为所述可移动平台提供动力;以及
    一个或多个处理器,用于执行上述权利要求1至20中任一项所述的方法。
  44. 一种系统,其特征在于,包括:如权利要求43所述的可移动平台和显示设备,
    所述可移动平台与所述显示设备有线连接或无线连接。
PCT/CN2020/092827 2020-05-28 2020-05-28 图像处理的方法、装置、可移动平台以及系统 WO2021237555A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN202080004902.6A CN112673380A (zh) 2020-05-28 2020-05-28 图像处理的方法、装置、可移动平台以及系统
PCT/CN2020/092827 WO2021237555A1 (zh) 2020-05-28 2020-05-28 图像处理的方法、装置、可移动平台以及系统

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2020/092827 WO2021237555A1 (zh) 2020-05-28 2020-05-28 图像处理的方法、装置、可移动平台以及系统

Publications (1)

Publication Number Publication Date
WO2021237555A1 true WO2021237555A1 (zh) 2021-12-02

Family

ID=75413910

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/092827 WO2021237555A1 (zh) 2020-05-28 2020-05-28 图像处理的方法、装置、可移动平台以及系统

Country Status (2)

Country Link
CN (1) CN112673380A (zh)
WO (1) WO2021237555A1 (zh)

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160042252A1 (en) * 2014-08-05 2016-02-11 Sri International Multi-Dimensional Realization of Visual Content of an Image Collection
US20170169315A1 (en) * 2015-12-15 2017-06-15 Sighthound, Inc. Deeply learned convolutional neural networks (cnns) for object localization and classification
CN107247952A (zh) * 2016-07-28 2017-10-13 哈尔滨工业大学 基于深层监督的循环卷积神经网络的视觉显著性检测方法
CN107967474A (zh) * 2017-11-24 2018-04-27 上海海事大学 一种基于卷积神经网络的海面目标显著性检测方法
CN108521862A (zh) * 2017-09-22 2018-09-11 深圳市大疆创新科技有限公司 用于跟踪拍摄的方法和设备
CN110598610A (zh) * 2019-09-02 2019-12-20 北京航空航天大学 一种基于神经选择注意的目标显著性检测方法
US20200097754A1 (en) * 2018-09-25 2020-03-26 Honda Motor Co., Ltd. Training saliency

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160042252A1 (en) * 2014-08-05 2016-02-11 Sri International Multi-Dimensional Realization of Visual Content of an Image Collection
US20170169315A1 (en) * 2015-12-15 2017-06-15 Sighthound, Inc. Deeply learned convolutional neural networks (cnns) for object localization and classification
CN107247952A (zh) * 2016-07-28 2017-10-13 哈尔滨工业大学 基于深层监督的循环卷积神经网络的视觉显著性检测方法
CN108521862A (zh) * 2017-09-22 2018-09-11 深圳市大疆创新科技有限公司 用于跟踪拍摄的方法和设备
CN107967474A (zh) * 2017-11-24 2018-04-27 上海海事大学 一种基于卷积神经网络的海面目标显著性检测方法
US20200097754A1 (en) * 2018-09-25 2020-03-26 Honda Motor Co., Ltd. Training saliency
CN110598610A (zh) * 2019-09-02 2019-12-20 北京航空航天大学 一种基于神经选择注意的目标显著性检测方法

Also Published As

Publication number Publication date
CN112673380A (zh) 2021-04-16

Similar Documents

Publication Publication Date Title
US11165959B2 (en) Connecting and using building data acquired from mobile devices
CN108961312B (zh) 用于嵌入式视觉系统的高性能视觉对象跟踪方法及系统
US11064178B2 (en) Deep virtual stereo odometry
US11315266B2 (en) Self-supervised depth estimation method and system
WO2019161813A1 (zh) 动态场景的三维重建方法以及装置和系统、服务器、介质
US20210133996A1 (en) Techniques for motion-based automatic image capture
WO2021043273A1 (zh) 图像增强方法和装置
CN112567201A (zh) 距离测量方法以及设备
US11057604B2 (en) Image processing method and device
US20220058407A1 (en) Neural Network For Head Pose And Gaze Estimation Using Photorealistic Synthetic Data
CN108759826B (zh) 一种基于手机和无人机多传感参数融合的无人机运动跟踪方法
CN114424250A (zh) 结构建模
CN111046734A (zh) 基于膨胀卷积的多模态融合视线估计方法
WO2022021027A1 (zh) 目标跟踪方法、装置、无人机、系统及可读存储介质
US11831931B2 (en) Systems and methods for generating high-resolution video or animated surface meshes from low-resolution images
US20210009270A1 (en) Methods and system for composing and capturing images
WO2022133381A1 (en) Object segmentation and feature tracking
CN112907557A (zh) 道路检测方法、装置、计算设备及存储介质
US11188787B1 (en) End-to-end room layout estimation
KR20210029692A (ko) 비디오 영상에 보케 효과를 적용하는 방법 및 기록매체
US20220198731A1 (en) Pixel-aligned volumetric avatars
CA3069813C (en) Capturing, connecting and using building interior data from mobile devices
WO2021237555A1 (zh) 图像处理的方法、装置、可移动平台以及系统
CN111611869A (zh) 一种基于串行深度神经网络的端到端单目视觉避障方法
Pinard et al. End-to-end depth from motion with stabilized monocular videos

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20937758

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20937758

Country of ref document: EP

Kind code of ref document: A1