WO2022083118A1 - 一种数据处理方法及相关设备 - Google Patents

一种数据处理方法及相关设备 Download PDF

Info

Publication number
WO2022083118A1
WO2022083118A1 PCT/CN2021/095141 CN2021095141W WO2022083118A1 WO 2022083118 A1 WO2022083118 A1 WO 2022083118A1 CN 2021095141 W CN2021095141 W CN 2021095141W WO 2022083118 A1 WO2022083118 A1 WO 2022083118A1
Authority
WO
WIPO (PCT)
Prior art keywords
image
target
background
optical flow
pose
Prior art date
Application number
PCT/CN2021/095141
Other languages
English (en)
French (fr)
Inventor
王波
张梦晗
王海涛
李江
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Publication of WO2022083118A1 publication Critical patent/WO2022083118A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T7/00Image analysis
    • G06T7/20Analysis of motion
    • G06T7/269Analysis of motion using gradient-based methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T3/00Geometric image transformations in the plane of the image
    • G06T3/40Scaling of whole images or parts thereof, e.g. expanding or contracting
    • G06T3/4038Image mosaicing, e.g. composing plane images from plane sub-images
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T5/00Image enhancement or restoration
    • G06T5/50Image enhancement or restoration using two or more images, e.g. averaging or subtraction
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T5/00Image enhancement or restoration
    • G06T5/77Retouching; Inpainting; Scratch removal
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2200/00Indexing scheme for image data processing or generation, in general
    • G06T2200/32Indexing scheme for image data processing or generation, in general involving image mosaicing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20081Training; Learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20084Artificial neural networks [ANN]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20212Image combination
    • G06T2207/20221Image fusion; Image merging

Definitions

  • the present application relates to the field of communications, and in particular, to a data processing method and related equipment.
  • Panoramic moments are a kind of special effects that use computer vision technology to achieve enhanced slow motion and time pause. This technology is used in movies and TV (such as bullet time in The Matrix), live sports events (such as Intel TrueView), etc. field.
  • the way to obtain the wonderful moments of the panorama is to select a venue (such as a basketball court) in advance, and set up multiple high-definition cameras at fixed positions around the venue, and use a large number of expensive multiple high-definition cameras to focus on a scene synchronously. Then use the 3D modeling method to recreate the 3D image of the same volume (such as: basketball player). Render scenes and 3D character images to get panoramic moments. It enables the audience to experience the shock and immersive feeling that traditional live broadcasting cannot bring.
  • a venue such as a basketball court
  • Embodiments of the present application provide a data processing method and related equipment. Can be used to generate intermediate view images.
  • a first aspect of the embodiments of the present application provides a data processing method, which may be executed by a data processing apparatus, or may be executed by a component of the data processing apparatus (for example, a processor, a chip, or a chip system, etc.), wherein the data
  • the processing device may be a local device (eg, a mobile phone, a camera, etc.) or a cloud device.
  • the method can also be executed jointly by the local device and the cloud device.
  • the method includes: acquiring a first image and a second image, where the first image is an image collected from a first viewing angle, and the second image is an image collected from a second viewing angle; obtaining a relative position between the first image and the second image pose; based on the first image, the second image, and the relative pose, generate a third image, where the viewing angle of the third image is between the first viewing angle and the second viewing angle.
  • the above-mentioned first image and second image are images collected by the first acquisition device and the second acquisition device for the same photographing object at the same time and from different viewing angles, And there is a first overlapping content between the first image and the second image, and the first overlapping content includes the shooting object;
  • the above-mentioned relative pose can be understood as the relative relationship between the first pose and the second pose
  • the pose wherein the first pose is the pose when the first collection device collects the first image; the second pose is the pose when the second collection device collects the second image.
  • the above-mentioned third image includes part or all of the first overlapping content, and the third image includes the photographed object.
  • the orientations of the photographed objects in the first image and the photographed objects in the second image overlap.
  • a third image is generated, and the viewing angle of the third image is between the first viewing angle and the second viewing angle . Synthesize other perspective images through existing perspective images and relative poses to improve the fineness of the output effect.
  • the relative pose includes a first relative pose and a second relative pose
  • the first relative pose is the first image relative to the second relative pose.
  • the pose of the image, and the second relative pose is the pose of the second image relative to the first image;
  • generating a third image includes: combining the first image and the second image
  • the image is input into the trained optical flow computing network for optical flow calculation, and the initial optical flow map is obtained;
  • the first image and the initial optical flow map are processed by the pre-image warping forwad warping method to obtain the first target optical flow image;
  • the first target optical flow image is processed by the forwad warping method
  • the second image and the initial optical flow map are used to obtain the second target optical flow image;
  • the first image and the first relative pose are processed by the image warping method to obtain the first warped image;
  • the second image and the second image are processed by the image warping method
  • a second distorted image is obtained relative to the pose;
  • the above-mentioned first relative pose is the pose of the first collection device at the first collection moment relative to the second collection device at the second collection moment.
  • the second relative pose is the pose of the second collection device at the second collection moment relative to the first collection device at the first collection moment.
  • the first distorted image and the second distorted image can be realized by combining the first distorted image and the second distorted image with relatively complete features and the first target optical flow image and the second target optical flow image with obvious detailed features.
  • the information complementation between the second distorted image and the first target optical flow image and the second target optical flow image can provide more reference for the subsequent image inpainting network to generate the third image, so that the generated third image is smoother.
  • the trained optical flow calculation network and the trained image restoration network in the above steps are calculated by using the first training image and the second training image as the optical flow.
  • the input of the network is obtained by jointly training the optical flow calculation network and the image inpainting network with the value of the loss function less than the second threshold as the target; the loss function is used to indicate the difference between the image output by the image inpainting network and the third target image,
  • the third target image is an image collected at a viewing angle between the first target image corresponding to the first target viewing angle and the second target image corresponding to the second target viewing angle.
  • the first training image, the second training image and the third target image are used to implement the training process of the optical flow computing network and the image inpainting network, so as to provide a more optimized optical flow computing network and image inpainting network for the follow-up. , to improve the fineness of the output image (ie, the third image).
  • the relative pose includes a transformation matrix, and the transformation matrix is used to describe the relationship between the pixels of the first image and the second image, and the first image
  • the relative pose includes a first transformation matrix, the first transformation matrix is a matrix of the first image relative to the second image, the second relative pose includes a second transformation matrix, and the second transformation matrix is the second image relative to the first image. matrix.
  • the above-mentioned relative pose includes a transformation matrix
  • the expression form of the relative pose is a transformation matrix
  • the relative pose is described by the transformation matrix, and the transformation matrix, the first image, and the second image can be directly processed through image warping to obtain the first warped image and the second warped image, which is highly versatile.
  • the above steps further include: obtaining a target image based on the background of the first image, the background of the second image, and the third image, where the target image includes the image in the third image. subject.
  • the target image also includes part or all of the background of the first image (also referred to as the first background image) and part or all of the background of the second image (also referred to as the second background image).
  • the first image includes the target person and the first background image
  • the second image includes the target person and the second background image
  • the above-mentioned target person is equivalent to the aforementioned shooting object
  • the first background image can be understood as a background other than the shooting object in the first image
  • the second background image can be understood as the background other than the photographed object in the second image.
  • a task image corresponding to an intermediate viewing angle can be synthesized for the character images in the multiple original images, so as to meet the requirements of special effects such as slow motion or time pause.
  • the above steps further include: splicing the first background image and the second background image to obtain a target background image, and fusing the third image and the target background image to obtain the target image.
  • the synthesized intermediate-view image can be fused with the large-view background image, so as to realize the seamless connection between the front and rear backgrounds, thus ensuring the output of panoramic video clips of wonderful moments.
  • the above step: obtaining the target image based on the background of the first image, the background of the second image, and the third image includes: separating the shooting objects in the first image to obtain the target image.
  • the above steps further include: fusing the first image and the target background image to obtain the first target image; fusing the second image and the target background image to obtain the second target image; compress the first target image, the target image and the second target image to obtain the target video.
  • the above steps further include: sending the target video to the first shooting device.
  • the above steps further include: sending the target video to the second shooting device.
  • the video after the video is generated, it can be fed back to the first shooting device, so that the user can watch the panoramic wonderful video (ie the target video) through the first shooting device, thereby increasing the function and playability of the client device.
  • the above-mentioned first photographing device may also be called a first collecting device, and the second photographing device may also be called a second collecting device.
  • a second aspect of the embodiments of the present application provides a data processing apparatus, where the data processing apparatus may be a local device (for example, a mobile phone, a camera, etc.) or a cloud device.
  • the data processing device includes:
  • an acquisition unit configured to acquire a first image and a second image
  • the first image is an image collected from a first viewing angle
  • the second image is an image collected from a second viewing angle
  • the acquisition moment of the first image and the acquisition of the second image the same time
  • an acquisition unit further configured to acquire the relative pose between the first image and the second image
  • the generating unit is configured to generate a third image based on the first image, the second image and the relative pose, and the viewing angle of the third image is between the first viewing angle and the second viewing angle.
  • the above-mentioned first image and second image are images collected by the first acquisition device and the second acquisition device for the same photographing object at the same moment and from different viewing angles, And the first image and the second image have first overlapping content, and the first overlapping content includes the shooting object; optionally, the third image includes part or all of the first overlapping content, and the third image includes the subject.
  • the orientations of the photographed objects in the first image and the photographed objects in the second image overlap.
  • the above-mentioned relative pose can be understood as the relative pose between the first pose and the second pose, wherein the first pose is the first pose.
  • the relative pose includes a first relative pose and a second relative pose, and the first relative pose is a pose of the first image relative to the second image.
  • the second relative pose is the pose of the second image relative to the first image;
  • the above-mentioned first relative pose is the pose of the first collection device at the first collection moment relative to the second collection device at the second collection moment.
  • the second relative pose is the pose of the second collection device at the second collection moment relative to the first collection device at the first collection moment.
  • Generation units include:
  • the optical flow calculation subunit is used to input the first image and the second image into the trained optical flow calculation network for optical flow calculation to obtain an initial optical flow map;
  • the first warping subunit is used to process the first image and the initial optical flow map by the pre-image warping forwad warping method to obtain the first target optical flow image;
  • the first warping subunit is also used to process the second image and the initial optical flow map by the forwad warping method to obtain the second target optical flow image;
  • the second warping subunit is used to process the first image and the first relative pose by the image warping method to obtain the first warped image
  • the second warping subunit is used to process the second image and the second relative pose by the image warping method to obtain the second warped image
  • the repairing subunit is used for inputting the first target optical flow image, the first distorted image, the second target optical flow image, and the second distorted image into the trained image inpainting network for image repairing to obtain a third image.
  • the trained optical flow computing network and the trained image restoration network in the data processing device are obtained by using the first training image and the second training image as optical
  • the input of the flow computing network is obtained by jointly training the optical flow computing network and the image inpainting network with the goal that the value of the loss function is less than the second threshold;
  • the loss function is used to indicate the difference between the image output by the image inpainting network and the third target image, where the third target image is the perspective between the first target image corresponding to the first target perspective and the second target image corresponding to the second target perspective images collected below.
  • the relative pose in the above-mentioned data processing device includes a transformation matrix, and the transformation matrix is used to describe the association relationship between the pixels of the first image and the second image, and the first image A relative pose includes a first transformation matrix, the first transformation matrix is a matrix of the first image relative to the second image, the second relative pose includes a second transformation matrix, and the second transformation matrix is the second image relative to the first image matrix.
  • the expression form of the above-mentioned relative pose is a transformation matrix.
  • the above-mentioned data processing device further includes: a splicing unit for obtaining a target image based on the background of the first image, the background of the second image and the third image,
  • the target image includes the subject in the third image.
  • the first image in the data processing apparatus includes the target person and the first background image
  • the second original image includes the target person and the second background image
  • the above-mentioned target person is equivalent to the previous shooting object
  • the first background image can be understood as the background other than the shooting object in the first image
  • the second background image can be understood as the background other than the subject in the second image.
  • the above data processing apparatus further includes:
  • a splicing unit for splicing the first background image and the second background image to obtain the target background image
  • the fusion unit is used for fusing the third image and the target background image to obtain the target image.
  • the above is to obtain a target image based on the background of the first image, the background of the second image and the third image, and the target image includes the shooting in the third image. object.
  • the above-mentioned splicing unit is specifically used to separate the photographed object in the first image to obtain the first hollow image; the splicing unit is specifically used to fill in the image based on the first image.
  • the first hollow image obtains the background of the first image; the shooting objects in the second image are separated to obtain the second hollow image; the splicing unit is specifically configured to fill the second hollow image based on the second image to obtain the background of the second image; the splicing unit, Specifically, the target image is generated by splicing the background of the first image, the background of the second image and the third image.
  • the fusion unit in the above data processing device is further configured to fuse the first image and the target background image to obtain the first target image; the fusion unit is also used to fuse the first image and the target background image. fusing the second image and the target background image to obtain the second target image;
  • the above-mentioned data processing device also includes:
  • the compression unit is used for compressing the first target image, the target image and the second target image to obtain the target video.
  • the above data processing apparatus further includes:
  • the sending unit is used for sending the target video to the first shooting device.
  • a third aspect of the embodiments of the present application provides a data processing apparatus, where the data processing apparatus may be a mobile phone or a video camera. It may also be a cloud device (such as a server, etc.), and the data processing apparatus executes the method in the foregoing first aspect or any possible implementation manner of the first aspect.
  • the data processing apparatus may be a mobile phone or a video camera. It may also be a cloud device (such as a server, etc.), and the data processing apparatus executes the method in the foregoing first aspect or any possible implementation manner of the first aspect.
  • a fourth aspect of the embodiments of the present application provides a chip, where the chip includes a processor and a communication interface, the communication interface and the processor are coupled, and the processor is configured to run a computer program or instruction, so that the chip implements the first aspect or the first aspect above method in any possible implementation of .
  • a fifth aspect of the embodiments of the present application provides a computer-readable storage medium, where an instruction is stored in the computer-readable storage medium, and when the instruction is executed on a computer, causes the computer to execute the foregoing first aspect or any possibility of the first aspect method in the implementation.
  • a sixth aspect of the embodiments of the present application provides a computer program product, which, when executed on a computer, enables the computer to execute the method in the foregoing first aspect or any possible implementation manner of the first aspect.
  • a seventh aspect of an embodiment of the present application provides a data processing apparatus, including: a processor, where the processor is coupled to a memory, and the memory is used to store programs or instructions, and when the programs or instructions are executed by the processor, the data processing apparatus realizes The method in the above first aspect or any possible implementation manner of the first aspect.
  • a third image is generated based on the first image, the second image, and the relative pose between the first image and the second image, and the perspective of the third image is between the first viewing angle and the second viewing angle.
  • the present application can synthesize an image of a middle perspective by using the existing two perspective images and relative poses, so as to improve the fineness of the output effect.
  • FIG. 1 is a schematic diagram of an application scenario provided by an embodiment of the present application.
  • FIG. 2 is a schematic diagram of a positional relationship between a primary device and a secondary device in an embodiment of the application;
  • FIG. 3 is a schematic structural diagram of a system architecture provided by an embodiment of the present application.
  • FIG. 4 is a schematic structural diagram of a convolutional neural network according to an embodiment of the present invention.
  • FIG. 5 is a schematic structural diagram of another convolutional neural network provided by an embodiment of the present invention.
  • FIG. 6 is a schematic flowchart of a data processing method provided by an embodiment of the present application.
  • FIG. 7 is a schematic diagram of feature points in a first image and a second image provided by an embodiment of the present application.
  • FIG. 8 is a schematic diagram of a matching pair between a first image and a second image provided by an embodiment of the present application.
  • FIG. 9 is a schematic diagram of acquiring a third image according to an embodiment of the present application.
  • FIG. 10 is another schematic flowchart of the data processing method provided by the embodiment of the present application.
  • FIG. 11 is a schematic diagram of a first original image and a first character image provided by an embodiment of the present application.
  • FIG. 12 is a schematic diagram of a matching pair between a first original image and a second original image provided by an embodiment of the present application
  • FIG. 13 is another schematic diagram of acquiring a third image provided by an embodiment of the present application.
  • FIG. 14 is another schematic diagram of acquiring two third images according to an embodiment of the present application.
  • 15 is a schematic diagram of an original image and a background image provided by an embodiment of the present application.
  • 16 is a schematic diagram of a target background image provided by an embodiment of the present application.
  • 17 is another schematic diagram of a target background image provided by an embodiment of the present application.
  • FIG. 18 is another schematic diagram of a target image provided by an embodiment of the present application.
  • 19 is a schematic diagram of a target video provided by an embodiment of the application.
  • FIG. 20 is another schematic diagram of a target video provided by an embodiment of the present application.
  • 21 is a schematic structural diagram of a data processing apparatus provided by an embodiment of the present application.
  • FIG. 22 is another schematic structural diagram of a data processing apparatus provided by an embodiment of the present application.
  • FIG. 23 is another schematic structural diagram of a data processing apparatus provided by an embodiment of the present application.
  • FIG. 24 is a schematic diagram of a chip hardware structure provided by an embodiment of the present application.
  • Figure 1 shows a schematic diagram of an application scenario, which can be applied to the field of image processing in the field of artificial intelligence.
  • the application scenario may include the cloud device 100 , the master device 101 , and the slave devices 102 to 104 that communicate with the master device 101 .
  • FIG. 1 only one master device 101 and three slave devices 102 to 104 are used as examples for schematic illustration.
  • the application scenarios in the embodiments of the present application may have more primary devices and secondary devices, and the embodiments of the present application do not limit the number of primary devices and secondary devices.
  • each secondary device is connected to the cloud device can also be different. It can be that multiple secondary devices 102 to 104 are connected to the cloud device 100 through the main device 101, or multiple secondary devices can be directly connected to the cloud device. Not limited.
  • the secondary devices 102 to 104 and the main device 101, or between the main device 101 and the cloud device 100 are generally connected through a wireless network, or can be connected through a wired network. If it is connected through a wireless network, the specific connection form can be cellular. A wireless network, or a WiFi network, or another type of wireless network. If it is connected through a wired network, the general connection form is an optical fiber network.
  • the main function of the main device 101 and the sub-devices 102 to 104 is to capture images. Further, the main device 101 and the sub-devices 102 to 104 can also be used to capture a 3D scene.
  • the positional relationship between the primary device 101 and the secondary devices 102 to 104 may be a ring deployment (for example, as shown in FIG. 2 , wherein the number of primary devices in the ring deployment shown in FIG. 2 is 1, and the number of secondary devices is 5 , the specific number of devices is just an example), spherical deployment, cube deployment, etc.
  • the specific time positional relationship between the primary device and the secondary device is not limited here.
  • the angle between two adjacent devices in the primary device 101 and the secondary devices 102 to 104 is less than or equal to a certain threshold.
  • the master device 101 may control the slave devices 102 to 104 to trigger simultaneous shooting, and then the slave devices 102 to 104 transmit the acquired images at the same time to the master device 101 .
  • the main device 101 can process multiple images using algorithms to obtain data such as target images or target videos.
  • the main device 101 may also send data such as target images or target videos to the sub-devices 102 to 104 .
  • the master device 101 may control the slave devices 102 to 104 to trigger simultaneous shooting, and then the slave devices 102 to 104 transmit the acquired images at the same time to the master device 101 .
  • the main device 101 can upload multiple images to the cloud device 100, and the cloud device 100 uses an algorithm to process the multiple images to obtain data such as target images or target videos.
  • the cloud device 100 may also send data such as target images or target videos to the main device 101 .
  • the main device 101 can also send data such as target images or target videos to the sub-devices 102 to 104 . So as to complete the results from acquisition to final effect presentation.
  • the primary device or the secondary device is a device with a shooting function, which may be a video camera, a camera, a mobile phone (mobile phone), a tablet computer (Pad), an augmented reality (AR) terminal device or a wearable terminal equipment, etc.
  • a shooting function which may be a video camera, a camera, a mobile phone (mobile phone), a tablet computer (Pad), an augmented reality (AR) terminal device or a wearable terminal equipment, etc.
  • the embodiments of the present application can be applied not only in the field of image processing in the field of artificial intelligence, but also in other scenarios that require intermediate perspective synthesis, such as movies and TV (such as bullet time in The Matrix) , live sports events (such as: Intel TrueView) or the 3D perspective applied by the real estate trading platform.
  • intermediate perspective synthesis such as movies and TV (such as bullet time in The Matrix) , live sports events (such as: Intel TrueView) or the 3D perspective applied by the real estate trading platform.
  • movies and TV such as bullet time in The Matrix
  • live sports events such as: Intel TrueView
  • a neural network can be composed of neural units, and a neural unit can refer to an operation unit that takes X s and an intercept 1 as input, and the output of the operation unit can be:
  • W s is the weight of X s
  • b is the bias of the neural unit.
  • f is an activation function of the neural unit, which is used to introduce nonlinear characteristics into the neural network to convert the input signal in the neural unit into an output signal. The output signal of this activation function can be used as the input of the next convolutional layer.
  • the activation function can be a sigmoid function.
  • a neural network is a network formed by connecting many of the above single neural units together, that is, the output of one neural unit can be the input of another neural unit.
  • the input of each neural unit can be connected with the local receptive field of the previous layer to extract the features of the local receptive field, and the local receptive field can be an area composed of several neural units.
  • Deep neural network also known as multi-layer neural network
  • DNN Deep neural network
  • the neural network inside DNN can be divided into three categories: input layer, hidden layer, and output layer.
  • the first layer is the input layer
  • the last layer is the output layer
  • the middle layers are all hidden layers.
  • the layers are fully connected, that is, any neuron in the i-th layer must be connected to any neuron in the i+1-th layer.
  • the coefficient from the kth neuron of the L-1 layer to the jth neuron of the Lth layer is defined as W jk L .
  • W jk L the coefficient from the kth neuron of the L-1 layer to the jth neuron of the Lth layer.
  • the input layer does not have a W parameter.
  • W jk L the coefficient from the kth neuron of the L-1 layer to the jth neuron of the Lth layer.
  • W jk L the coefficient from the kth neuron of the L-1 layer to the jth neuron of the Lth layer.
  • the input layer does not have a W parameter.
  • more hidden layers allow the network to better capture the complexities of the real world.
  • a model with more parameters is more complex and has a larger "capacity", which means that it can complete more complex learning tasks.
  • Training the deep neural network is the process of learning the weight matrix, and its ultimate goal is to obtain the weight matrix of all layers of the trained deep neural network (the weight matrix
  • Convolutional neural network is a deep neural network with a convolutional structure.
  • a convolutional neural network consists of a feature extractor consisting of convolutional and subsampling layers.
  • the feature extractor can be viewed as a filter, and the convolution process can be viewed as convolution with an input image or a convolutional feature map using a trainable filter.
  • the convolutional layer refers to the neuron layer in the convolutional neural network that convolves the input signal.
  • a neuron can only be connected to some of its neighbors.
  • a convolutional layer usually contains several feature planes, and each feature plane can be composed of some neural units arranged in a rectangle.
  • Neural units in the same feature plane share weights, and the shared weights here are convolution kernels.
  • Shared weights can be understood as the way to extract image information is independent of location. The underlying principle is that the statistics of one part of the image are the same as the other parts. This means that image information learned in one part can also be used in another part. So for all positions on the image, the same learned image information can be used.
  • multiple convolution kernels can be used to extract different image information. Generally, the more convolution kernels, the richer the image information reflected by the convolution operation.
  • the convolution kernel can be initialized in the form of a matrix of random size, and the convolution kernel can obtain reasonable weights by learning during the training process of the convolutional neural network.
  • the immediate benefit of sharing weights is to reduce the connections between the layers of the convolutional neural network, while reducing the risk of overfitting.
  • the convolutional neural network can use the error back propagation (BP) algorithm to correct the size of the parameters in the initial super-resolution model during the training process, so that the reconstruction error loss of the super-resolution model becomes smaller and smaller. Specifically, forwarding the input signal until the output will generate an error loss, and updating the parameters in the initial super-resolution model by back-propagating the error loss information, so that the error loss converges.
  • the back-propagation algorithm is a back-propagation motion dominated by the error loss, aiming to obtain the parameters of the optimal super-resolution model, such as the weight matrix.
  • the pixel value of the image can be a red-green-blue (RGB) color value, and the pixel value can be a long integer representing the color.
  • the pixel value is 256*Red+100*Green+76Blue, where Blue represents the blue component, Green represents the green component, and Red represents the red component. In each color component, the smaller the value, the lower the brightness, and the larger the value, the higher the brightness.
  • the pixel values can be grayscale values.
  • an embodiment of the present invention provides a system architecture 100 .
  • the data collection device 160 is used to collect training data
  • the training data in this embodiment of the present application includes: a first training image and a second training image.
  • the first training image may be the first image
  • the second training image may be the second image. It can also be understood that the first training image and the first image are images collected from the same viewing angle for the same scene, and the second training image and the second image are images collected from the same scene from another viewing angle.
  • the training data is stored in the database 130 , and the training device 120 obtains the target model/rule 101 through training based on the training data maintained in the database 130 .
  • the first embodiment will be used to describe in more detail how the training device 120 obtains the target model/rule 101 based on the training data.
  • the target model/rule 101 can be used to implement the data processing method provided by the The two images of the perspective are input into the target model/rule 101 after relevant preprocessing, and then the image of the middle perspective can be obtained.
  • the target model/rule 101 in this embodiment of the application may specifically be an optical flow computing network and/or an image inpainting network. In the embodiments provided in this application, the optical flow computing network and/or the image inpainting network The training image and the second training image are obtained.
  • the training data maintained in the database 130 may not necessarily come from the collection of the data collection device 160, and may also be received from other devices.
  • the training device 120 may not necessarily train the target model/rule 101 completely based on the training data maintained by the database 130, and may also obtain training data from the cloud or other places for model training. The above description should not be used as a reference to this application Limitations of Examples.
  • the target model/rule 101 trained according to the training device 120 can be applied to different systems or devices, such as the execution device 110 shown in FIG. 3 , the execution device 110 can be a terminal, such as a mobile phone terminal, a tablet computer, Notebook computer, AR/VR, vehicle terminal, etc., it can also be a server or cloud, etc.
  • the execution device 110 is configured with an I/O interface 112, which is used for data interaction with external devices.
  • the user can input data to the I/O interface 112 through the client device 140, and the input data is described in the embodiments of the present application. may include: the first image and the second image, which may be input by the user, or uploaded by the user through the photographing device, and of course may also come from a database, which is not specifically limited here.
  • the preprocessing module 113 is configured to perform preprocessing according to the input data (such as the first image and the second image) received by the I/O interface 112.
  • the preprocessing module 113 may be configured to Perform operations such as size trimming of the data (for example, when the size specifications of the first image and the second image output by each slave device or master device are inconsistent, the preprocessing module 113 can also be used to normalize the first image and the second image. deal with).
  • the execution device 110 When the execution device 110 preprocesses the input data, or the calculation module 111 of the execution device 110 performs calculations and other related processing, the execution device 110 can call the data, codes, etc. in the data storage system 150 for corresponding processing , the data and instructions obtained by corresponding processing may also be stored in the data storage system 150 .
  • the I/O interface 112 returns the processing result, such as the third image obtained as described above, to the client device 140 so as to be provided to the user.
  • the training device 120 can generate corresponding target models/rules 101 based on different training data for different goals or tasks, and the corresponding target models/rules 101 can be used to achieve the above goals or complete The above task, thus providing the user with the desired result.
  • the user can manually specify input data, which can be operated through the interface provided by the I/O interface 112 .
  • the client device 140 can automatically send the input data to the I/O interface 112 . If the user's authorization is required to request the client device 140 to automatically send the input data, the user can set the corresponding permission in the client device 140 .
  • the user can view the result output by the execution device 110 on the client device 140, and the specific presentation form can be a specific manner such as display, sound, and action.
  • the client device 140 can also act as a data collection terminal to collect the input data of the input I/O interface 112 and the output result of the output I/O interface 112 as new sample data as shown in the figure, and store them in the database 130 .
  • the I/O interface 112 directly uses the input data input into the I/O interface 112 and the output result of the output I/O interface 112 as shown in the figure as a new sample The data is stored in database 130 .
  • FIG. 3 is only a schematic diagram of a system architecture provided by an embodiment of the present invention, and the positional relationship between the devices, devices, modules, etc. shown in the figure does not constitute any limitation.
  • the data storage system 150 is an external memory relative to the execution device 110 , and in other cases, the data storage system 150 may also be placed in the execution device 110 .
  • a target model/rule 101 is obtained by training according to the training device 120.
  • the target model/rule 101 may be an optical flow computing network and/or an image restoration network.
  • the target model/rule 101 may be implemented in this application.
  • both the optical flow calculation network and the image inpainting network can be convolutional neural networks.
  • the execution device 110 in FIG. 3 may be the cloud device shown in FIG. 1, and the client device 140 may be the main device or the secondary device shown in FIG. 1, that is, the method provided in this application mainly Applied to cloud devices.
  • the execution device 110 in FIG. 3 may be the master device shown in FIG. 1, and the client device 140 may be the slave device shown in FIG. 1, that is, the method provided in this application is mainly applied to main device.
  • a convolutional neural network is a deep neural network with a convolutional structure and a deep learning architecture. learning at multiple levels of abstraction.
  • CNN is a feed-forward artificial neural network in which individual neurons can respond to images fed into it.
  • a convolutional neural network (CNN) 100 may include an input layer 110 , a convolutional/pooling layer 120 , where the pooling layer is optional, and a neural network layer 130 .
  • the convolutional layer/pooling layer 120 may include layers 121-126 as examples.
  • layer 121 is a convolutional layer
  • layer 122 is a pooling layer
  • layer 123 is a convolutional layer
  • layer 124 is a convolutional layer.
  • Layers are pooling layers
  • 125 are convolutional layers
  • 126 are pooling layers; in another implementation, 121 and 122 are convolutional layers, 123 are pooling layers, 124 and 125 are convolutional layers, and 126 are pooling layer. That is, the output of a convolutional layer can be used as the input of a subsequent pooling layer, or it can be used as the input of another convolutional layer to continue the convolution operation.
  • the convolution layer 121 may include many convolution operators, which are also called kernels, and their role in image processing is equivalent to a filter that extracts specific information from the input image matrix.
  • the convolution operator can be essentially a weight matrix. This weight matrix is usually pre-defined. In the process of convolving an image, the weight matrix is usually pixel by pixel along the horizontal direction on the input image ( Or two pixels after two pixels...depending on the value of stride), which completes the work of extracting specific features from the image.
  • the size of the weight matrix should be related to the size of the image. It should be noted that the depth dimension of the weight matrix is the same as the depth dimension of the input image.
  • the weight matrix will be extended to Enter the entire depth of the image. Therefore, convolution with a single weight matrix will produce a single depth dimension of the convolutional output, but in most cases a single weight matrix is not used, but multiple weight matrices of the same dimension are applied.
  • the output of each weight matrix is stacked to form the depth dimension of the convolutional image.
  • Different weight matrices can be used to extract different features in the image. For example, one weight matrix is used to extract image edge information, another weight matrix is used to extract specific colors of the image, and another weight matrix is used to extract unwanted noise in the image. Perform fuzzification...
  • the dimensions of the multiple weight matrices are the same, and the dimension of the feature maps extracted from the weight matrices with the same dimensions are also the same, and then the multiple extracted feature maps with the same dimensions are combined to form the output of the convolution operation .
  • weight values in these weight matrices need to be obtained through a lot of training in practical applications, and each weight matrix formed by the weight values obtained by training can extract information from the input image, thereby helping the convolutional neural network 100 to make correct predictions.
  • the initial convolutional layer for example, 121
  • the features extracted by the later convolutional layers become more and more complex, such as features such as high-level semantics.
  • pooling layer after the convolutional layer, that is, each layer 121-126 exemplified by 120 in Figure 4, which can be a convolutional layer followed by a layer
  • the pooling layer can also be a multi-layer convolutional layer followed by one or more pooling layers.
  • the pooling layer may include an average pooling operator and/or a max pooling operator for sampling the input image to obtain a smaller size image.
  • the average pooling operator can calculate the average value of the pixel values in the image within a certain range.
  • the max pooling operator can take the pixel with the largest value within a specific range as the result of max pooling. Also, just as the size of the weight matrix used in the convolutional layer should be related to the size of the image, the operators in the pooling layer should also be related to the size of the image.
  • the size of the output image after processing by the pooling layer can be smaller than the size of the image input to the pooling layer, and each pixel in the image output by the pooling layer represents the average or maximum value of the corresponding sub-region of the image input to the pooling layer.
  • the convolutional neural network 100 After being processed by the convolutional layer/pooling layer 120, the convolutional neural network 100 is not sufficient to output the required output information. Because as mentioned before, the convolutional layer/pooling layer 120 only extracts features and reduces the parameters brought by the input image. However, in order to generate the final output information (required class information or other related information), the convolutional neural network 100 needs to utilize the neural network layer 130 to generate one or a set of outputs of the required number of classes. Therefore, the neural network layer 130 may include multiple hidden layers (131, 132 to 13n as shown in FIG. 4) and the output layer 140, and the parameters contained in the multiple hidden layers may be based on specific task types The relevant training data is pre-trained, for example, the task type can include image recognition, image classification, image super-resolution reconstruction, etc...
  • the output layer 140 After the multi-layer hidden layers in the neural network layer 130, that is, the last layer of the entire convolutional neural network 100 is the output layer 140, the output layer 140 has a loss function similar to the classification cross entropy, and is specifically used to calculate the prediction error,
  • the forward propagation of the entire convolutional neural network 100 (as shown in Fig. 4, the propagation from 110 to 140 is forward propagation) is completed, the back propagation (as shown in Fig. 4 from 140 to 110 as the back propagation) will start to update.
  • the weight values and biases of the aforementioned layers are used to reduce the loss of the convolutional neural network 100 and the error between the result output by the convolutional neural network 100 through the output layer and the ideal result.
  • the convolutional neural network 100 shown in FIG. 4 is only used as an example of a convolutional neural network.
  • the convolutional neural network may also exist in the form of other network models, for example, such as
  • the multiple convolutional layers/pooling layers shown in FIG. 5 are in parallel, and the extracted features are input to the full neural network layer 130 for processing.
  • the algorithm processing of the CNN can be applied to the main device 101 or the cloud device 100 shown in FIG. 1 .
  • an embodiment of the data processing method in the embodiment of the present application includes:
  • the data processing apparatus acquires a first image and a second image.
  • the data processing device is the main device in the scene shown in FIG. 1, and the first shooting device and the second shooting device are any two sub-devices in the scene shown in FIG. 1 as an example for schematic description.
  • the data processing apparatus may also be a cloud device in the scenario shown in FIG. 1
  • the first shooting device and the second shooting device may be a primary device or a secondary device. There is no specific limitation here.
  • the first image may be an image of the subject captured by the first shooting device from a first viewing angle
  • the second image may be an image of the subject captured by the second shooting device from a second viewing angle
  • the first shooting device The moment of capturing the first image is the same as the moment of capturing the second image by the second photographing device (or the time interval between the moment of capturing the first image and the moment of capturing the second image is less than or equal to a preset threshold). That is, the first image and the second image are images obtained by multiple photographing devices at the same time and under different viewing angles.
  • the above-mentioned photographing device may also be referred to as a collecting device.
  • the first image and the second image are images captured by the first photographing device and the second photographing device on the same photographed object at the same time and from different viewing angles, and the first image and the second image have overlapping content.
  • the photographing objects may refer to objects such as people, animals, objects, etc., which are not specifically limited here.
  • overlapping content between the first image and the second image can be understood as the same part of the screen content in the first image and the second image, for example, the overlapping content (or area, area, etc.) of the first image and the second image. area) is greater than or equal to a certain threshold (eg 20%).
  • a certain threshold eg 20%.
  • the existence of overlapping content between the first image and the second image can also be understood that the picture content of the first image and the second image have the same photographing object.
  • the orientations of the photographed objects in the first image and the second image overlap.
  • the difference between the first distance and the second distance is less than or equal to a certain preset threshold, where the first distance is the distance between the first photographing device and the reference point when the first photographing device collects the first image, and the second The distance is the distance between the second photographing device and the reference point when the second photographing device collects the second image.
  • the reference point may refer to a certain position where the person photographing the object is located, for example, the photographing object is a person, and the reference point may be the position where the person is located, such as the middle position of the stage.
  • the position at which the first photographing device collects the first image and the position at which the second photographing device collects the second image are co-located on an arc with the photographing object as the inner side.
  • the overlapping angle of the field of view of the first image and the second image is greater than a certain threshold (for example: the overlapping angle of the first viewing angle and the second viewing angle is greater than 30 degrees); and/or a photographing device that captures two images
  • the difference in rotation angle is smaller than the preset angle.
  • the rotation angle may be an angle value rotated by a horizontal angle of the photographing device, or may be an angle value rotated by a top-down angle of the photographing device.
  • the moment when the first photographing device collects the first image is the same as the moment when the second photographing device collects the second image, and it can also be considered as the time interval between the moment when the first image is collected and the moment when the second image is collected.
  • the preset threshold is set according to actual needs, which is not specifically limited here.
  • the first photographing device and the second photographing device collect the first image and the second image
  • the first image and the second image are sent to the data processing apparatus.
  • the data processing apparatus acquires the relative pose between the first image and the second image.
  • the relative pose includes a first relative pose and a second relative pose
  • the first relative pose is the pose of the first image relative to the second image
  • the second relative pose is the relative pose of the second image. in the pose of the first image.
  • the first relative pose is the pose when the first photographing device collects the first image
  • the second relative pose is the pose when the second photographing device collects the second image. That is, the pose of the second image refers to the pose when the second photographing device captures the second image.
  • the pose of the first image refers to the pose when the first photographing device captures the first image.
  • the relative pose between the first image and the second image refers to the relative pose between the first pose and the second pose.
  • the relative pose between the first image and the second image described in the embodiments of the present application is substantially the relative pose between the first pose and the second pose, and the first pose is collected by the first collection device The pose when the first image is taken; the second pose is the pose when the second acquisition device collects the second image.
  • the relative pose in the embodiments of the present application may include parameters such as a fundamental matrix or a transformation matrix (H), and it may also be understood that parameters such as a fundamental matrix or a transformation matrix may be used to describe the relative pose. That is, if a transformation matrix is used to describe the relative pose, the transformation matrix includes a first transformation matrix and a second transformation matrix, the first transformation matrix is the matrix of the first image relative to the second image, and the second transformation matrix is the second image relative to the first transformation matrix. A matrix of images.
  • the data processing apparatus can estimate the relative pose between the first image and the second image by means of feature point extraction and SFM.
  • SIFT scale-invariant feature transform
  • ANN approximate nearest neighbor algorithm
  • RANSAC Random Sample Consensus
  • the RANSAC algorithm can effectively eliminate the deviation caused by the error points to the model parameters, and the transformation matrix obtained by the RANSAC algorithm and the eight-point method is more accurate.
  • the data processing apparatus first obtains the SIFT feature points of the first image and the second image, and then obtains the remaining matching pairs as shown in FIG. 8 by matching with the ANN method. Then use RANSAC and the eight-point method to estimate the transformation matrix for the remaining matching pairs, so as to obtain the relative pose (ie, RT matrix) between the first photographing device and the second photographing device.
  • the data processing apparatus generates a third image based on the first image, the second image, and the relative pose.
  • the data processing device After the data processing device acquires the first image, the second image and the relative pose, the data processing device can input the first image and the second image into the trained optical flow calculation network to perform optical flow calculation, and obtain an initial optical flow map (for example, The initial optical flow graph shown in Figure 9).
  • the initial optical flow map may be used to describe the displacement process of the pixel points, and the initial optical flow map is consistent with the size of the first image and the second image.
  • the data processing apparatus may process the first image and the initial optical flow map through a forwad warping method to obtain a first target optical flow image (for example, I1 shown in FIG. 9 ).
  • the second image and the initial optical flow map are processed by the forwad warping method to obtain the second target optical flow image (for example, I2 shown in Figure 9). It can be understood that, through the initial optical flow map, the moving direction and distance of each pixel in the first image between the first viewing angle of the first image and the second viewing angle of the second image are obtained. Therefore, more optical flow information of pixels can be provided, so that the pixels in the subsequently generated third image are smoother.
  • one or more target optical flow images can be generated according to forwad warping.
  • the above-mentioned first target optical flow image and second target optical flow image are only examples, and the specific number of target optical flow images is not limited here.
  • the first target optical flow image can be obtained by the following first conversion formula.
  • x 1 and y 1 represent the coordinates of a certain pixel point P in the first image (also known as the old coordinates of point P), and t x and ty represent the old coordinates (x 1 , y 1 ) of point P under the The size of the distance that the optical flow moves in the X-axis direction and the Y-axis direction. Because the size of the first image is the same as the size of the initial optical flow map, (x 1 , y 1 ) and (t x , ty ) can be in one-to-one correspondence.
  • each pixel in the first image is similar to the above-mentioned operation of pixel P.
  • the first image is mapped to the first target.
  • the pixel points are assigned, and the value of each pixel in the first target optical flow image is determined by interpolation operations (such as the nearest neighbor difference method, bilinear interpolation method, bicubic interpolation method, etc.) , and then generate the first target optical flow image.
  • interpolation operations such as the nearest neighbor difference method, bilinear interpolation method, bicubic interpolation method, etc.
  • the data processing apparatus may further process the first image and the first relative pose by using an image warping method to obtain a first warped image (for example, I0 shown in FIG. 9 ).
  • the second image and the second relative pose are processed by the image warping method to obtain a second warped image (for example, I3 shown in FIG. 9 ). It can be understood that since the first distorted image and the second distorted image obtained by using the relative pose can provide more image texture information for the subsequent image inpainting network, it is convenient for the image inpainting network to deal with more defects.
  • one or more warped images can be generated according to image warping, the above-mentioned first warped image and second warped image are just examples, and the specific number of warped images is not limited here.
  • the first relative pose is a first transformation matrix (ie, H is a 3 ⁇ 3 matrix), and the first warped image can be obtained by the following second transformation formula.
  • x in the above-mentioned second transformation formula is the old coordinate of a certain pixel point Q in the first image
  • H is the transformation matrix obtained before (which can be used to describe the relative pose)
  • x' is the first distorted image.
  • the last element h 33 of the H matrix in the above second conversion formula is always 1.
  • each pixel in the first image is similar to the above-mentioned operation of pixel Q.
  • the first image is mapped to the first distorted image, Assign a value to a pixel point, and determine the value of each pixel point in the first distorted image through interpolation operations (such as the nearest neighbor difference method, bilinear interpolation method, bicubic interpolation method, etc.) during the assignment process, and then generate the first distortion image.
  • interpolation operations such as the nearest neighbor difference method, bilinear interpolation method, bicubic interpolation method, etc.
  • the data processing device inputs the first target optical flow image, the first distorted image, the second target optical flow image and the second distorted image into the trained image inpainting network to perform image inpainting, so as to obtain the middle point between the first perspective and the second perspective.
  • the third image corresponding to the viewing angle ie, the third viewing angle
  • the optical flow estimation network and the image inpainting network both use the CNN network based on the Unet structure.
  • the middle perspective can be understood as, after the first plane normal vector of the first image and the second plane normal vector of the second image are translated, the image corresponding to any ray between the two plane normal vectors can be called as A third image corresponding to a third viewing angle between the first viewing angle and the second viewing angle.
  • the features of the first warped image and the second warped image are comprehensively covered, and the first target optical flow image and the The second target optical flow image is obtained according to the optical flow information.
  • the detailed features (ie optical flow features) of the first target optical flow image and the second target optical flow image are relatively comprehensive, so the first distorted image with relatively complete features and the The second warped image and the first target optical flow image and the second target optical flow image with obvious detailed features can realize the combination of the first warped image and the second warped image, and the first target optical flow image and the second target optical flow image.
  • the complementary information between them helps the intermediate view images generated by the subsequent image inpainting network to have more features and details.
  • the number of third images is set according to actual needs, which is not specifically limited here.
  • a third image corresponding to the third viewing angle shown in FIG. 9 is obtained through step 603 .
  • the above-mentioned trained optical flow calculation network and trained image restoration network are calculated by using the first training image and the second training image as the input of the optical flow calculation network, and taking the value of the loss function less than the second threshold as the target to calculate the optical flow.
  • the network and the image inpainting network are jointly trained.
  • the loss function is used to indicate the difference between the image output by the image inpainting network and the third target image.
  • the third target image is the first target image corresponding to the first target perspective and the second target image
  • the target image corresponds to an image collected at a viewing angle between the second target viewing angles.
  • the joint training of the optical flow computing network and the image inpainting network refers to: training the optical flow computing network and the image inpainting network as a whole network, which can also be understood as, compared with the two target optical flow images generated in the middle, Joint training pays more attention to the effect of the third image output by the overall network.
  • the optical flow computing network and the image inpainting network are completed in the way of end-to-end overall training.
  • the training data sets are mainly composed of three sets of images, one The left image (ie the first training image), a right image (the second training image) and a middle image (the third target image).
  • the left and right images are used as input and the middle image is used as output to supervise the end-to-end learning of the entire network.
  • the third image obtained by the specific implementation is one.
  • the third image obtained by the specific implementation is multiple images. In practical applications, whether the third image is one or multiple can be adjusted according to the number of inputs and outputs during training, which is not specifically limited here.
  • the third image may be sent to the first photographing device and/or the second photographing device, so that a user using the first photographing device and/or the second photographing device may view the third image.
  • a third image is generated based on the first image, the second image, and the relative pose between the first image and the second image, and the viewing angle of the third image is between the first viewing angle and the second viewing angle . Synthesize other perspective images through existing perspective images and relative poses to improve the fineness of the output effect.
  • the first distorted image and the second distorted image and the third distorted image can be realized.
  • the information complementation between the optical flow image of the first target and the optical flow image of the second target helps the intermediate view image generated by the subsequent image inpainting network to have more features and details, which is convenient for the image inpainting network to deal with more defects, so that the The resulting third image is more flat.
  • Panoramic moments are a kind of special effects that use computer vision technology to achieve enhanced slow motion and time pause. This technology is used in movies and TV (such as bullet time in The Matrix), live sports events (such as Intel TrueView), etc. field.
  • the way to obtain the wonderful moments of the panorama is to select a venue (such as a basketball court) in advance, and set up multiple high-definition cameras at fixed positions around the venue, and use a large number of expensive multiple high-definition cameras to focus on a scene synchronously. Then use the 3D modeling method to recreate the 3D image of the same volume (such as: basketball player). Render scenes and 3D character images to get panoramic moments. It enables the audience to experience the shock and immersive feeling that traditional live broadcasting cannot bring.
  • a venue such as a basketball court
  • the high-definition camera in the above method needs to set a fixed position in advance. If the panoramic video needs to be obtained in other scenarios, the position of the high-definition camera needs to be reset, which makes the application scenario of the above method inflexible.
  • Embodiments of the present application provide a data processing method and related equipment. Can be used to generate intermediate view images.
  • the present application also provides a data processing method, which can acquire panoramic wonderful videos through a mobile device (eg, a mobile phone).
  • a mobile device eg, a mobile phone
  • another embodiment of the data processing method in the embodiment of the present application includes:
  • a data processing apparatus acquires a first image and a second image.
  • the data processing apparatus in this embodiment of the present application may be the main device 101 or the cloud device 100 in the scenario shown in FIG. 1 , which is not specifically limited here.
  • the first image and the second image may be acquired directly by a photographing device, or may be obtained by processing images collected by the photographing device. That is, the first image is obtained by processing the images collected by the first photographing device under the first viewing angle, and the second image is obtained by processing the images collected by the second photographing device at the second viewing angle.
  • the first image and the second image are images captured by the first photographing device and the second photographing device for the same photographing object at the same time and from different viewing angles, and the first image and the second image have overlapping content.
  • the data processing apparatus in this embodiment of the present application may be a first shooting device, a second shooting device, a target shooting device connected to the first shooting device and the second shooting device (that is, the master device 101 in the scene shown in FIG. 1 ). ) or cloud device, which is not limited here.
  • the first photographing device captures a first original image from a first perspective, where the first original image includes a target person and a first background other than the target person.
  • the second photographing device captures a second original image from a second perspective, where the second original image includes a target person and a second background other than the target person.
  • the target person is equivalent to the subject in front.
  • the data processing device obtains the first image and the second image in a variety of ways, which are described below:
  • the data processing apparatus extracts the first image and the second image from the first original image and the second original image.
  • the data processing apparatus acquires the first original image collected by the first photographing device and the second original image collected by the second photographing device. And extract the first person image in the first original image and the second person image in the second original image, where both the first person image and the second person image include the target person.
  • the data processing device determines that the first person image is the first image, and the second person image is the second image.
  • the data processing apparatus may segment the first original image to obtain a first person image and a first background image.
  • the data processing device may divide the second original image to obtain a second person image and a second background image. And determine the first person image as the first image, and determine the second person image as the second image.
  • the data processing apparatus can also directly extract the first person image from the first original image, and the method used for extracting the first person image is not specifically limited here.
  • the data processing device may also firstly use a CNN-based portrait segmentation algorithm to segment the first original image and the second original image, to obtain the first binary segmentation map and the second binary segmentation map respectively, and the pixel values of the foreground regions of the two segmentation maps. It is 1 (the area of the target person), and the pixel value of the background area is 0 (the background area other than the target person).
  • a first person image is obtained according to the first image and the first binary segmentation map
  • a second person image is obtained according to the second image and the second binary segmentation map.
  • the data processing device further determines that the first person image is the first image, and the second person image is the second image.
  • the data processing apparatus acquires the first image and the second image from the first photographing device and the second photographing device.
  • the first photographing device extracts the first person image from the first original image, and when sending the first person image to the data processing apparatus, the second photographing device may also extract the second person image from the second original image, and then sends the first person image to the data processing device.
  • the device transmits the second person image.
  • the data processing device determines that the first person image is the first image, and the second person image is the second image.
  • the data processing apparatus can acquire the first image and the second image in various manners, and the above two are just examples, and are not specifically limited here.
  • the data processing apparatus acquires the relative pose between the first image and the second image.
  • the relative pose between the first image and the second image can be estimated by means of feature point extraction and SFM.
  • Step 1002 in this embodiment is similar to the aforementioned step 602 in FIG. 6 , and details are not repeated here.
  • the data processing apparatus generates a third image based on the first image, the second image, and the relative pose.
  • the method for generating the third image in step 1003 in this embodiment is similar to the method for generating the third image corresponding to step 603 in the aforementioned FIG. 6 .
  • the following method flow is described with reference to the accompanying drawings. For specific principles and implementation methods, refer to the aforementioned steps in FIG. 6 . 603, which will not be repeated here.
  • the data processing device may process the first image and the initial optical flow map through the forwad warping method to obtain the first target optical flow image (for example, I5 shown in FIG. 13 ).
  • the second image and the initial optical flow map are processed by the forwad warping method to obtain the second target optical flow image (for example, I6 shown in FIG. 13 ). It can be understood that, through the initial optical flow map, the moving direction and distance of each pixel in the first image between the first viewing angle of the first image and the second viewing angle of the second image are obtained. Therefore, more optical flow information of pixels can be provided, so that the pixels in the subsequently generated third image are smoother.
  • the data processing apparatus may also process the first image and the first relative pose through the image warping method to obtain a first warped image (for example, I4 shown in FIG. 13 ).
  • the second image and the second relative pose are processed by the image warping method to obtain a second warped image (for example, I7 shown in FIG. 13 ). It can be understood that since the first distorted image and the second distorted image obtained by using the relative pose can provide more image texture information for the subsequent image inpainting network, it is convenient for the image inpainting network to deal with more defects.
  • the data processing device inputs the first target optical flow image, the first distorted image, the second target optical flow image and the second distorted image into the trained image inpainting network to perform image inpainting, so as to obtain the middle point between the first perspective and the second perspective.
  • the third image corresponding to the viewing angle ie, the third viewing angle
  • the number of third objects may be one or more (for example, as shown in FIG. 14 , the number of third images is two), which is not specifically limited here.
  • the data processing apparatus splices the first background image and the second background image to obtain a target background image.
  • the data processing apparatus obtains a first background image after extracting a first person image from the first original image, and obtains a second background image after extracting a second person image from the second original image. .
  • the first person image and the second person image have overlapping content, for example, the first person image and the second person image both have the same person.
  • the first original image in the above may also be understood as the first image
  • the second original image may be understood as the second image
  • the data processing apparatus may also directly extract the first background image from the first original image, and directly extract the second background image from the second original image.
  • the data processing apparatus can also simply divide the first original image to obtain the first hollow image, and then fill the first hollow image according to the first original image to obtain the first background image.
  • the data processing device may also simply divide the second original image to obtain the second hollow image, and then fill the second hollow image according to the second original image to obtain the second background image.
  • the first hollow image can also be understood as an image obtained after deducting or separating the area of the photographed object from the first original image.
  • the data processing device fills the first hole image according to the first original image, and the specific process of obtaining the first background image may also use CNN to realize the background hole filling work, which is not limited herein.
  • the data processing device can directly stitch the first background image and the second background image to obtain the target background image.
  • the first background image and the second background image perform SIFT feature point extraction on the first background image and the second background image, then perform feature point matching, and then perform special processing (such as smoothing) on the overlapping boundaries of the first background image and the second background image, so that The first background image and the second background image are spliced into a target background image (as shown in FIG. 16 ).
  • special processing such as smoothing
  • the target background image may be obtained by splicing the first background image and the second background image with reference to the relative pose.
  • the spliced target background images are shown in FIG. 17 .
  • the data processing apparatus fuses the third image and the target background image to obtain the target image.
  • the data processing device After obtaining the target background image and the third image, the data processing device obtains the target image after fusing the target background image and the third image.
  • the data processing apparatus fuses the third image and the target background image to obtain the target image.
  • the data processing device fuses the third image into a certain area (for example, a central area) of the target background image by using Poisson Blending to obtain the target image, thereby achieving a more natural fusion effect, and the target An image is a frame in the output video.
  • the fusion uses Poisson fusion technology, which is to embed an object or an area in the third image into the target background image according to the gradient information of the third image and the boundary information of the target background image to generate a new image, that is, target image.
  • the data processing device can also use Poisson fusion to fuse the first image and the target background image to generate the first target image, and can also use Poisson fusion to fuse the second image and the target background image to generate the second target image, and compress the first target image.
  • the image, the target image, and the second target image generate a target video.
  • the generated target video may be as shown in FIG. 19 .
  • the first frame of the target video is the first target image
  • the second frame of the target video is the target image
  • the third frame of the target video is the second target image.
  • the generated target video may be as shown in FIG. 20 .
  • the first frame of the target video is the first target image
  • the second and third frames of the target video are the target image
  • the fourth frame of the target video is the second target image.
  • the data processing apparatus may send the target video to the first shooting device and/or the second shooting device, so that users using the first shooting device and/or the second shooting device can watch the target video.
  • the reference images can be complementary to each other, which is convenient for image restoration.
  • the network handles more artifacts, making the resulting third image smoother.
  • a target video (or a panoramic highlight video) can be generated according to the first image, the second image, the third image and the target background image.
  • the first photographing device, the second photographing device and the data processing device can be mobile phones, a mobile mobile phone can be used to generate panoramic moments (that is, target videos). Have flexibility.
  • the embodiments of the present application further provide corresponding apparatuses, including corresponding modules for executing the foregoing embodiments.
  • the modules may be software, hardware, or a combination of software and hardware.
  • the data processing apparatus may be a local device (eg, a mobile phone, a camera, etc.) or a cloud device.
  • the data processing device includes:
  • the acquisition unit 2101 is used to acquire a first image and a second image, the first image is an image collected from a first viewing angle, the second image is an image collected from a second viewing angle, and the time between the collection moment of the first image and the second image is the same.
  • the collection time is the same.
  • the first image and the second image are images collected by the first capture device and the second capture device for the same photographing object at the same time and from different viewing angles, and the first image and the second image have first overlapping content, the first overlapping content includes the photographing object;
  • the acquiring unit 2101 is further configured to acquire the relative pose between the first image and the second image.
  • the generating unit 2102 is configured to generate a third image based on the first image, the second image and the relative pose, and the perspective of the third image is between the first perspective and the second perspective.
  • the above-mentioned relative pose can be understood as the relative pose between the first pose and the second pose, wherein the first pose is the pose when the first acquisition device collects the first image; The two poses are the poses when the second acquisition device collects the second image.
  • the third image includes part or all of the first overlapping content, and the third image includes the photographed object.
  • the orientations of the photographed objects in the first image and the photographed objects in the second image overlap.
  • each unit in the data processing apparatus is similar to those described in the foregoing embodiments shown in FIG. 6 to FIG. 20 , and details are not repeated here.
  • the generating unit 2102 generates a third image based on the first image, the second image, and the relative pose between the first image and the second image, and the viewing angle of the third image is between the first viewing angle and the second viewing angle. between. Synthesize other perspective images through existing perspective images and relative poses to improve the fineness of the output effect.
  • the data processing apparatus may be a local device (eg, a mobile phone, a camera, etc.) or a cloud device.
  • the data processing device includes:
  • the acquisition unit 2201 is used to acquire a first image and a second image, the first image is an image collected from a first viewing angle, the second image is an image collected from a second viewing angle, and the time between the collection moment of the first image and the second image is the same.
  • the collection time is the same.
  • the acquiring unit 2201 is further configured to acquire the relative pose between the first image and the second image.
  • the generating unit 2202 is configured to generate a third image based on the first image, the second image and the relative pose, where the perspective of the third image is between the first perspective and the second perspective.
  • the above-mentioned generating unit 2202 also includes:
  • the optical flow calculation subunit 22021 is used to input the first image and the second image into the trained optical flow calculation network for optical flow calculation, and obtain the initial optical flow map.
  • the first warping subunit 22022 is used to process the first image and the initial optical flow map through the forward image warping forwad warping method to obtain the first target optical flow image.
  • the first warping subunit 22022 is further configured to process the second image and the initial optical flow map through the forwad warping method to obtain the second target optical flow image.
  • the second warping subunit 22023 is configured to process the first image and the first relative pose through the image warping method to obtain the first warped image.
  • the second warping subunit 22023 is configured to process the second image and the second relative pose through the image warping method to obtain a second warped image.
  • the repairing subunit 22024 is configured to input the first target optical flow image, the first distorted image, the second target optical flow image, and the second distorted image into the trained image inpainting network for image repairing to obtain a third image.
  • the splicing unit 2203 is used for splicing the first background image and the second background image to obtain the target background image.
  • the fusion unit 2204 is configured to fuse the third image and the target background image to obtain the target image.
  • the fusion unit 2204 is further configured to fuse the first image and the target background image to obtain the first target image.
  • the fusion unit 2204 is further configured to fuse the second image and the target background image to obtain the second target image.
  • the compression unit 2205 is configured to compress the first target image, the target image and the second target image to obtain the target video.
  • the sending unit 2206 is configured to send the target video to a first photographing device, where the first photographing device is a device that captures the first image.
  • the first background image is the background other than the photographed object in the first image
  • the second background image is the background other than the photographed object in the second image
  • the compression unit 2205 is configured to generate a target video based on the first target image, the target image and the second target image.
  • each unit in the data processing apparatus is similar to those described in the foregoing embodiments shown in FIG. 6 to FIG. 20 , and details are not repeated here.
  • the generating unit 2102 can use the reference images I4 and I7 obtained by the relative pose and the reference images I5 and I6 obtained by the optical flow information, so that there are complementary places between the reference images, which is convenient for The image inpainting network handles more artifacts, resulting in a flatter third image.
  • the compression unit 2106 may generate a target video (which may also be a panoramic highlight video) according to the first image, the second image, the third image and the target background image.
  • the first photographing device, the second photographing device and the data processing device can be mobile phones, a mobile mobile phone can be used to generate panoramic moments (that is, target videos). Have flexibility.
  • the embodiment of the present application provides another data processing apparatus.
  • the data processing device can be any terminal device including a mobile phone, a tablet computer, a personal digital assistant (PDA), a point of sales (POS), a vehicle-mounted computer, etc.
  • the data processing device is a mobile phone as an example:
  • FIG. 23 is a block diagram showing a partial structure of a mobile phone provided by an embodiment of the present application.
  • the mobile phone includes: a radio frequency (RF) circuit 2210, a memory 2220, an input unit 2230, a display unit 2240, a sensor 2250, an audio circuit 2260, a wireless fidelity (WiFi) module 2270, and a processor 2280 , and the camera 2290 and other components.
  • RF radio frequency
  • the RF circuit 2310 can be used for receiving and sending signals during sending and receiving of information or during a call. In particular, after receiving the downlink information of the base station, it is processed by the processor 2380; in addition, it sends the designed uplink data to the base station.
  • the RF circuit 2310 includes, but is not limited to, an antenna, at least one amplifier, a transceiver, a coupler, a low noise amplifier (LNA), a duplexer, and the like.
  • the RF circuit 2310 can also communicate with networks and other devices via wireless communication.
  • the above-mentioned wireless communication can use any communication standard or protocol, including but not limited to the global system of mobile communication (global system of mobile communication, GSM), general packet radio service (general packet radio service, GPRS), code division multiple access (code division multiple access) multiple access, CDMA), wideband code division multiple access (wideband code division multiple access, WCDMA), long term evolution (long term evolution, LTE), email, short message service (short messaging service, SMS) and so on.
  • GSM global system of mobile communication
  • general packet radio service general packet radio service
  • code division multiple access code division multiple access
  • CDMA code division multiple access
  • WCDMA wideband code division multiple access
  • long term evolution long term evolution
  • email short message service
  • the memory 2320 can be used to store software programs and modules, and the processor 2380 executes various functional applications and data processing of the mobile phone by running the software programs and modules stored in the memory 2320 .
  • the memory 2320 may mainly include a stored program area and a stored data area, wherein the stored program area may store an operating system, an application program required for at least one function (such as a sound playback function, an image playback function, etc.), etc.; Data created by the use of the mobile phone (such as audio data, phone book, etc.), etc.
  • memory 2320 may include high-speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid state storage device.
  • the input unit 2330 can be used for receiving inputted numerical or character information, and generating key signal input related to user setting and function control of the mobile phone.
  • the input unit 2330 may include a touch panel 2331 and other input devices 2332 .
  • the touch panel 2331 also referred to as a touch screen, can collect the user's touch operations on or near it (such as the user's finger, stylus, etc., any suitable object or accessory on or near the touch panel 2331). operation), and drive the corresponding connection device according to the preset program.
  • the touch panel 2331 may include two parts, a touch detection device and a touch controller.
  • the touch detection device detects the user's touch orientation, detects the signal brought by the touch operation, and transmits the signal to the touch controller; the touch controller receives the touch information from the touch detection device, converts it into contact coordinates, and then sends it to the touch controller.
  • the touch panel 2331 can be realized by various types of resistive, capacitive, infrared, and surface acoustic waves.
  • the input unit 2330 may further include other input devices 2332 .
  • other input devices 2332 may include, but are not limited to, one or more of physical keyboards, function keys (such as volume control keys, switch keys, etc.), trackballs, mice, joysticks, and the like.
  • the display unit 2340 may be used to display information input by the user or information provided to the user and various menus of the mobile phone.
  • the display unit 2340 may include a display panel 2341.
  • the display panel 2341 may be configured in the form of a liquid crystal display (LCD), an organic light-emitting diode (OLED), or the like.
  • the touch panel 2331 can cover the display panel 2341. When the touch panel 2331 detects a touch operation on or near it, it transmits it to the processor 2380 to determine the type of the touch event, and then the processor 2380 determines the type of the touch event according to the touch event. Type provides corresponding visual output on display panel 2341.
  • the touch panel 2331 and the display panel 2341 are used as two independent components to realize the input and input functions of the mobile phone, in some embodiments, the touch panel 2331 and the display panel 2341 can be integrated to form a Realize the input and output functions of the mobile phone.
  • the cell phone may also include at least one sensor 2350, such as light sensors, motion sensors, and other sensors.
  • the light sensor may include an ambient light sensor and a proximity sensor, wherein the ambient light sensor may adjust the brightness of the display panel 2341 according to the brightness of the ambient light, and the proximity sensor may turn off the display panel 2341 and/or when the mobile phone is moved to the ear. or backlight.
  • the accelerometer sensor can detect the magnitude of acceleration in all directions (usually three axes), and can detect the magnitude and direction of gravity when it is stationary. games, magnetometer attitude calibration), vibration recognition related functions (such as pedometer, tapping), etc.; as for other sensors such as gyroscope, barometer, hygrometer, thermometer, infrared sensor, etc. Repeat.
  • the audio circuit 2360, the speaker 2361, and the microphone 2362 can provide the audio interface between the user and the mobile phone.
  • the audio circuit 2360 can transmit the received audio data converted electrical signals to the speaker 2361, and the speaker 2361 converts them into sound signals for output; on the other hand, the microphone 2362 converts the collected sound signals into electrical signals, and the audio circuit 2360 converts the collected sound signals into electrical signals. After receiving, it is converted into audio data, and then the audio data is output to the processor 2380 for processing, and then sent to, for example, another mobile phone through the RF circuit 2310, or the audio data is output to the memory 2320 for further processing.
  • WiFi is a short-distance wireless transmission technology.
  • the mobile phone can help users to send and receive emails, browse web pages and access streaming media through the WiFi module 2370. It provides users with wireless broadband Internet access.
  • FIG. 23 shows the WiFi module 2370, it can be understood that it is not a necessary component of the mobile phone.
  • the processor 2380 is the control center of the mobile phone, using various interfaces and lines to connect various parts of the entire mobile phone, by running or executing the software programs and/or modules stored in the memory 2320, and calling the data stored in the memory 2320.
  • the processor 2380 may include one or more processing units; preferably, the processor 2380 may integrate an application processor and a modem processor, wherein the application processor mainly processes the operating system, user interface, and application programs, etc. , the modem processor mainly deals with wireless communication. It can be understood that, the above-mentioned modulation and demodulation processor may not be integrated into the processor 2380.
  • the mobile phone also includes a camera 2390 for supplying power to various components.
  • the camera 2390 can capture the first image and/or the second image, and is logically connected to the processor 2380, so that the first image and the third image can be monitored by the processor 2380.
  • the specific processing flow refer to the steps in the embodiments shown in the foregoing FIG. 6 to FIG. 20 .
  • the mobile phone may also include a power source (such as a battery), a Bluetooth module, and the like, which will not be repeated here.
  • a power source such as a battery
  • a Bluetooth module such as Bluetooth
  • the power supply can be logically connected to the processor 2380 through a power management system, so that functions such as managing charging, discharging, and power consumption are implemented through the power management system.
  • the processor 2380 included in the data processing apparatus may perform the functions in the foregoing embodiments shown in FIG. 6 to FIG. 20 , and details are not described herein again.
  • FIG. 24 is a hardware structure of a chip provided by an embodiment of the present invention, where the chip includes a neural network processor 240 .
  • the chip can be set in the execution device 110 as shown in FIG. 3 to complete the calculation work of the calculation module 111 .
  • the chip can also be set in the training device 120 as shown in FIG. 3 to complete the training work of the training device 120 and output the target model/rule 101 .
  • the algorithms of each layer in the convolutional neural network shown in Figure 4 or Figure 5 can be implemented in the chip shown in Figure 24.
  • the neural network processor NPU 50NPU is mounted on the main CPU (Host CPU) as a co-processor, and tasks are assigned by the Host CPU.
  • the core part of the NPU is the operation circuit 2403, and the controller 2404 controls the operation circuit 2403 to extract the data in the memory (weight memory or input memory) and perform operations.
  • the arithmetic circuit 2403 includes multiple processing units (process engines, PEs). In some implementations, the arithmetic circuit 2403 is a two-dimensional systolic array. The arithmetic circuit 2403 may also be a one-dimensional systolic array or other electronic circuitry capable of performing mathematical operations such as multiplication and addition. In some implementations, the arithmetic circuit 2403 is a general-purpose matrix processor.
  • the operation circuit fetches the data corresponding to the matrix B from the weight memory 2402 and buffers it on each PE in the operation circuit.
  • the arithmetic circuit takes the data of matrix A and matrix B from the input memory 2401 to perform matrix operation, and the partial result or final result of the obtained matrix is stored in the accumulator 2408 accumulator.
  • the vector calculation unit 2407 can further process the output of the arithmetic circuit, such as vector multiplication, vector addition, exponential operation, logarithmic operation, size comparison and so on.
  • the vector calculation unit 2407 can be used for network calculation of non-convolutional/non-FC layers in the neural network, such as pooling (Pooling), batch normalization (batch normalization), local response normalization (local response normalization), etc. .
  • vector computation unit 2407 stores the processed output vectors to unified buffer 506 .
  • the vector calculation unit 2407 may apply a nonlinear function to the output of the arithmetic circuit 2403, such as a vector of accumulated values, to generate activation values.
  • the vector computation unit 2407 generates normalized values, merged values, or both.
  • the vector of processed outputs can be used as activation input to the arithmetic circuit 2403, eg, for use in subsequent layers in a neural network.
  • Unified memory 2406 is used to store input data and output data.
  • the weight data directly transfers the input data in the external memory to the input memory 2401 and/or the unified memory 2406 through the storage unit access controller 2405 (direct memory access controller, DMAC), and stores the weight data in the external memory into the weight memory 2402, And store the data in the unified memory 2406 into the external memory.
  • DMAC direct memory access controller
  • the bus interface unit (bus interface unit, BIU) 2410 is used to realize the interaction between the main CPU, the DMAC and the instruction fetch memory 2409 through the bus.
  • An instruction fetch buffer 2409 connected to the controller 2404 is used to store the instructions used by the controller 2404.
  • the controller 2404 is used for invoking the instructions cached in the memory 2409 to realize and control the working process of the operation accelerator.
  • the unified memory 2406, the input memory 2401, the weight memory 2402 and the instruction fetch memory 2409 are all on-chip (On-Chip) memories, and the external memory is the memory outside the NPU, and the external memory can be double data rate synchronous dynamic random access.
  • Memory double data rate synchronous dynamic random access memory, referred to as DDR SDRAM
  • high bandwidth memory high bandwidth memory, HBM
  • other readable and writable memory other readable and writable memory.
  • each layer in the convolutional neural network shown in FIG. 4 or FIG. 5 may be performed by the operation circuit 2403 or the vector calculation unit 2407 .
  • the disclosed system, apparatus and method may be implemented in other manners.
  • the apparatus embodiments described above are only illustrative.
  • the division of the units is only a logical function division. In actual implementation, there may be other division methods.
  • multiple units or components may be combined or Can be integrated into another system, or some features can be ignored, or not implemented.
  • the shown or discussed mutual coupling or direct coupling or communication connection may be through some interfaces, indirect coupling or communication connection of devices or units, and may be in electrical, mechanical or other forms.
  • the units described as separate components may or may not be physically separated, and components displayed as units may or may not be physical units, that is, may be located in one place, or may be distributed to multiple network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution in this embodiment.
  • each functional unit in each embodiment of the present application may be integrated into one processing unit, or each unit may exist physically alone, or two or more units may be integrated into one unit.
  • the above-mentioned integrated units may be implemented in the form of hardware, or may be implemented in the form of software functional units.
  • the integrated unit if implemented in the form of a software functional unit and sold or used as an independent product, may be stored in a computer-readable storage medium.
  • the technical solutions of the present application can be embodied in the form of software products in essence, or the parts that contribute to the prior art, or all or part of the technical solutions, and the computer software products are stored in a storage medium , including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute all or part of the steps of the methods described in the various embodiments of the present application.
  • the aforementioned storage medium includes: U disk, mobile hard disk, read-only memory (ROM), random access memory (RAM), magnetic disk or optical disk and other media that can store program codes .

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Data Mining & Analysis (AREA)
  • Multimedia (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • General Engineering & Computer Science (AREA)
  • Computing Systems (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • General Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Biophysics (AREA)
  • Molecular Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Biomedical Technology (AREA)
  • Evolutionary Biology (AREA)
  • Health & Medical Sciences (AREA)
  • Image Analysis (AREA)
  • Image Processing (AREA)

Abstract

一种数据处理方法及相关设备,涉及人工智能领域,具体涉及计算机视觉领域。该方法包括:获取第一图像以及第二图像;获取第一图像与第二图像之间的相对位姿);基于第一图像、第二图像以及相对位姿,生成第三图像,第三图像的视角在第一视角与第二视角之间。该方法能够通过已有的两个视角图像以及相对位姿合成中间视角的图像,提升输出效果的精细程度。

Description

一种数据处理方法及相关设备
本申请要求于2020年10月23日提交中国专利局、申请号为202011148726.1、发明名称为“一种数据处理方法及相关设备”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及通信领域,尤其涉及一种数据处理方法及相关设备。
背景技术
全景精彩瞬间是一种利用计算机视觉技术来实现强化的慢镜头、时间暂停的特效,该技术被应用于电影电视(如:黑客帝国中的子弹时间)、体育赛事直播(如:Intel TrueView)等领域。
目前,获取全景精彩瞬间的方式是:提前挑选好场地(如:篮球场),并在场地周围的固定位置设置多个高清摄像机,利用大量昂贵的多个高清摄像机来同步聚焦一个场景。再利用3D建模的方法重新创建等体积的3D人物图像(如:篮球运动员)。对场景以及3D人物图像进行渲染,得到全景精彩瞬间。使得观众体验到了传统直播无法带来的震撼及身临其境的感觉。
然而,上述方式中还存在一定问题,在观众体验全景精彩瞬间时,如何使得观众在图像与图像之间过渡的地方感觉平缓。
发明内容
本申请实施例提供了一种数据处理方法及相关设备。可以用于生成中间视角的图像。
本申请实施例第一方面提供了一种数据处理方法,该方法可以由数据处理装置执行,也可以由数据处理装置的部件(例如处理器、芯片、或芯片系统等)执行,其中,该数据处理装置可以是本地设备(例如,手机、摄像机等)或云端设备。该方法也可以由本地设备以及云端设备共同执行。该方法包括:获取第一图像以及第二图像,第一图像为第一视角下采集的图像,第二图像为第二视角下采集的图像;获取第一图像与第二图像之间的相对位姿;基于第一图像、第二图像以及相对位姿,生成第三图像,第三图像的视角在第一视角与第二视角之间。
可选地,在第一方面的一种可能的实现方式中,上述的第一图像与第二图像为第一采集设备与第二采集设备针对同一拍摄对象在同一时刻不同视角下采集的图像,且第一图像与第二图像存在第一交叠内容,该第一交叠内容包括该拍摄对象;其中,上述的相对位姿可以理解为是第一位姿与第二位姿之间的相对位姿,其中,第一位姿为第一采集设备采集第一图像时的位姿;第二位姿为第二采集设备采集第二图像时的位姿。上述的第三图像包括第一交叠内容的部分或全部,且第三图像包括该拍摄对象。可选地,第一图像中的拍摄对象与第二图像中的拍摄对象的朝向有重叠。
本申请实施例中,基于第一图像、第二图像以及第一图像与第二图像之间的相对位姿, 生成第三图像,该第三图像的视角在第一视角与第二视角之间。通过已有的视角图像以及相对位姿合成其他视角图像,提升输出效果的精细程度。
可选地,在第一方面的一种可能的实现方式中,上述步骤中:相对位姿包括第一相对位姿以及第二相对位姿,第一相对位姿为第一图像相对于第二图像的位姿,第二相对位姿为第二图像相对于第一图像的位姿;基于第一图像、第二图像以及相对位姿,生成第三图像,包括:将第一图像以及第二图像输入训练好的光流计算网络进行光流计算,得到初始光流图;通过前置图像扭曲forwad warping方法处理第一图像以及初始光流图得到第一目标光流图像;通过forwad warping方法处理第二图像以及初始光流图得到第二目标光流图像;通过图片图像扭曲image warping方法处理第一图像以及第一相对位姿得到第一扭曲图像;通过image warping方法处理第二图像以及第二相对位姿得到第二扭曲图像;将第一目标光流图像、第一扭曲图像、第二目标光流图像、第二扭曲图像输入训练好的图像修复网络进行图像修复,得到第三图像。
可选地,在第一方面的一种可能的实现方式中,上述的第一相对位姿为第一采集时刻的第一采集设备相对于所述第二采集时刻的第二采集设备的位姿,第二相对位姿为第二采集时刻的第二采集设备相对于所述第一采集时刻的第一采集设备的位姿。
该种可能的实现方式中,结合特征比较全的第一扭曲图像和第二扭曲图像以及细节特征比较明显的第一目标光流图像和第二目标光流图像,可以实现第一扭曲图像和第二扭曲图像以及第一目标光流图像和第二目标光流图像之间的信息互补,可以为后续图像修复网络生成第三图像提供更多参考,使得生成的第三图像更加平缓。
可选地,在第一方面的一种可能的实现方式中,上述步骤中训练好的光流计算网络以及训练好的图像修复网络是通过以第一训练图像以及第二训练图像作为光流计算网络的输入,以损失函数的值小于第二阈值为目标对光流计算网络以及图像修复网络进行联合训练得到;损失函数用于指示图像修复网络输出的图像与第三目标图像之间的差异,第三目标图像为在第一目标图像对应第一目标视角和第二目标图像对应第二目标视角之间的视角下采集的图像。
该种可能的实现方式中,通过第一训练图像、第二训练图像以及第三目标图像实现光流计算网络以及图像修复网络的训练过程,为后续提供更加优化的光流计算网络以及图像修复网络,提升输出图像(即第三图像)的精细程度。
可选地,在第一方面的一种可能的实现方式中,上述步骤中:相对位姿包括变换矩阵,变换矩阵用于描述第一图像与第二图像之间像素点的关联关系,第一相对位姿包括第一变换矩阵,第一变换矩阵为第一图像相对于第二图像的矩阵,第二相对位姿包括第二变换矩阵,第二变换矩阵为第二图像相对于第一图像的矩阵。
可选地,在第一方面的一种可能的实现方式中,上述的相对位姿包括变换矩阵可以理解为相对位姿的表达形式为变换矩阵。
该种可能的实现方式中,通过变换矩阵描述相对位姿,通过image warping可以直接处理变换矩阵、第一图像以及第二图像得到第一扭曲图像以及第二扭曲图像,通用性强。
可选地,在第一方面的一种可能的实现方式中,上述步骤还包括:基于第一图像的背景、第二图像的背景以及第三图像得到目标图像,目标图像包括第三图像中的拍摄对象。且目标图像还包括第一图像的背景(也可以称为第一背景图像)中的部分或全部与第二图像的背景 (也可以称为第二背景图像)中的部分或全部。
可选地,在第一方面的一种可能的实现方式中,上述步骤中:第一图像包括目标人物与第一背景图像,第二图像包括目标人物与第二背景图像。
可选地,在第一方面的一种可能的实现方式中,上述的目标人物相当于前面提到的拍摄对象,第一背景图像可以理解为是第一图像中除了拍摄对象以外的背景,第二背景图像可以理解为是第二图像中除了拍摄对象以外的背景。
该种可能的实现方式中,可以对多个原始图像中的人物图像进行中间视角对应的任务图像进行合成,满足慢镜头或时间暂停等特效的要求。
可选地,在第一方面的一种可能的实现方式中,上述步骤还包括:拼接第一背景图像以及第二背景图像,得到目标背景图像,融合第三图像以及目标背景图像,得到目标图像。该种可能的实现方式中,通过提取、拼接以及融合等操作,可以使得合成的中间视角图像与大视角的背景图像融合,实现了前后背景的无缝衔接,从而保证输出的全景精彩瞬间视频带有背景信息。可以应用于体育赛事等慢镜头之间图像的准确,以及图像之间过渡更加平缓。
可选地,在第一方面的一种可能的实现方式中,上述步骤:基于第一图像的背景、第二图像的背景以及第三图像得到目标图像包括:分离第一图像中的拍摄对象得到第一空洞图像;基于第一图像填充第一空洞图像得到第一图像的背景;分离第二图像中的拍摄对象得到第二空洞图像;基于第二图像填充第二空洞图像得到第二图像的背景;拼接第一图像的背景、第二图像的背景以及第三图像生成所述目标图像。可选地,在第一方面的一种可能的实现方式中,上述步骤还包括:融合第一图像以及目标背景图像,得到第一目标图像;融合第二图像以及目标背景图像,得到第二目标图像;压缩第一目标图像、目标图像以及第二目标图像得到目标视频。
该种可能的实现方式中,可以应用于全景精彩瞬间的生成,并使得方式灵活,可以使用移动的手机来生成全景精彩瞬间(即目标视频),相较于固定机位的全景精彩瞬间,本方法更具有灵活性。
可选地,在第一方面的一种可能的实现方式中,上述步骤还包括:向第一拍摄设备发送目标视频。
可选地,在第一方面的一种可能的实现方式中,上述步骤还包括:向第二拍摄设备发送目标视频。该种可能的实现方式中,在生成视频后,可以反馈给第一拍摄设备,使得用户可以通过第一拍摄设备观看全景精彩视频(即目标视频),增加用户端设备的功能及可玩性。
可选地,在第一方面的一种可能的实现方式中,上述的第一拍摄设备也可以称为第一采集设备,第二拍摄设备也可以称为第二采集设备。
本申请实施例第二方面提供一种数据处理装置,该数据处理装置可以是本地设备(例如,手机、摄像机等)或云端设备。该数据处理装置包括:
获取单元,用于获取第一图像以及第二图像,第一图像为第一视角下采集的图像,第二图像为第二视角下采集的图像,第一图像的采集时刻与第二图像的采集时刻相同;
获取单元,还用于获取第一图像与第二图像之间的相对位姿;
生成单元,用于基于第一图像、第二图像以及相对位姿,生成第三图像,第三图像的视角在第一视角与第二视角之间。
可选地,在第二方面的一种可能的实现方式中,上述的第一图像与第二图像为第一采集设备与第二采集设备针对同一拍摄对象在同一时刻不同视角下采集的图像,且第一图像与第二图像存在第一交叠内容,该第一交叠内容包括该拍摄对象;可选地,第三图像包括第一交叠内容的部分或全部,且第三图像包括该拍摄对象。可选地,第一图像中的拍摄对象与第二图像中的拍摄对象的朝向有重叠。
可选地,在第二方面的一种可能的实现方式中,上述的相对位姿可以理解为是第一位姿与第二位姿之间的相对位姿,其中,第一位姿为第一采集设备采集第一图像时的位姿;第二位姿为第二采集设备采集第二图像时的位姿。可选地,在第二方面的一种可能的实现方式中,相对位姿包括第一相对位姿以及第二相对位姿,第一相对位姿为第一图像相对于第二图像的位姿,第二相对位姿为第二图像相对于第一图像的位姿;
可选地,在第二方面的一种可能的实现方式中,上述的第一相对位姿为第一采集时刻的第一采集设备相对于所述第二采集时刻的第二采集设备的位姿,第二相对位姿为第二采集时刻的第二采集设备相对于所述第一采集时刻的第一采集设备的位姿。
生成单元包括:
光流计算子单元,用于将第一图像以及第二图像输入训练好的光流计算网络进行光流计算,得到初始光流图;
第一扭曲子单元,用于通过前置图像扭曲forwad warping方法处理第一图像以及初始光流图得到第一目标光流图像;
第一扭曲子单元,还用于通过forwad warping方法处理第二图像以及初始光流图得到第二目标光流图像;
第二扭曲子单元,用于通过图片图像扭曲image warping方法处理第一图像以及第一相对位姿得到第一扭曲图像;
第二扭曲子单元,用于通过image warping方法处理第二图像以及第二相对位姿得到第二扭曲图像;
修复子单元,用于将第一目标光流图像、第一扭曲图像、第二目标光流图像、第二扭曲图像输入训练好的图像修复网络进行图像修复,得到第三图像。
可选地,在第二方面的一种可能的实现方式中,上述数据处理装置中训练好的光流计算网络以及训练好的图像修复网络是通过以第一训练图像以及第二训练图像作为光流计算网络的输入,以损失函数的值小于第二阈值为目标对光流计算网络以及图像修复网络进行联合训练得到;
损失函数用于指示图像修复网络输出的图像与第三目标图像之间的差异,第三目标图像为在第一目标图像对应第一目标视角和第二目标图像对应第二目标视角之间的视角下采集的图像。
可选地,在第二方面的一种可能的实现方式中,上述数据处理装置中相对位姿包括变换矩阵,变换矩阵用于描述第一图像与第二图像之间像素点的关联关系,第一相对位姿包括第一变换矩阵,第一变换矩阵为第一图像相对于第二图像的矩阵,第二相对位姿包括第二变换矩阵,第二变换矩阵为第二图像相对于第一图像的矩阵。
可选地,在第二方面的一种可能的实现方式中,上述相对位姿的表达形式为变换矩阵。
可选地,在第二方面的一种可能的实现方式中,上述数据处理装置还包括:拼接单元,用于基于第一图像的背景、第二图像的背景以及第三图像拼接得到目标图像,目标图像包括第三图像中的拍摄对象。
可选地,在第二方面的一种可能的实现方式中,上述数据处理装置中第一图像包括目标人物与第一背景图像,第二原始图像包括目标人物与第二背景图像。
可选地,在第二方面的一种可能的实现方式中,上述的目标人物相当于前面的拍摄对象,第一背景图像可以理解为是第一图像中除了拍摄对象以外的背景,第二背景图像可以理解为是第二图像中除了拍摄对象以外的背景。
可选地,在第二方面的一种可能的实现方式中,上述数据处理装置还包括:
拼接单元,用于拼接第一背景图像以及第二背景图像,得到目标背景图像,
融合单元,用于融合第三图像以及目标背景图像,得到目标图像。可选地,在第二方面的一种可能的实现方式中,上述即基于第一图像的背景、第二图像的背景以及第三图像拼接得到目标图像,该目标图像包括第三图像中的拍摄对象。
可选地,在第一方面的一种可能的实现方式中,上述的拼接单元,具体用于分离第一图像中的拍摄对象得到第一空洞图像;拼接单元,具体用于基于第一图像填充第一空洞图像得到第一图像的背景;分离第二图像中的拍摄对象得到第二空洞图像;拼接单元,具体用于基于第二图像填充第二空洞图像得到第二图像的背景;拼接单元,具体用于拼接第一图像的背景、第二图像的背景以及第三图像生成所述目标图像。
可选地,在第二方面的一种可能的实现方式中,上述数据处理装置中的融合单元,还用于融合第一图像以及目标背景图像,得到第一目标图像;融合单元,还用于融合第二图像以及目标背景图像,得到第二目标图像;
上述数据处理装置还包括:
压缩单元,用于压缩第一目标图像、目标图像以及第二目标图像得到目标视频。
可选地,在第二方面的一种可能的实现方式中,上述数据处理装置还包括:
发送单元,用于向第一拍摄设备发送目标视频。
本申请实施例第三方面提供了一种数据处理装置,该数据处理装置可以是手机或摄像机。也可以是云端设备(例如服务器等),该数据处理装置执行前述第一方面或第一方面的任意可能的实现方式中的方法。
本申请实施例第四方面提供了一种芯片,该芯片包括处理器和通信接口,通信接口和处理器耦合,处理器用于运行计算机程序或指令,使得该芯片实现上述第一方面或第一方面的任意可能的实现方式中的方法。
本申请实施例第五方面提供了一种计算机可读存储介质,该计算机可读存储介质中存储有指令,该指令在计算机上执行时,使得计算机执行前述第一方面或第一方面的任意可能的实现方式中的方法。
本申请实施例第六方面提供了一种计算机程序产品,该计算机程序产品在计算机上执行时,使得计算机执行前述第一方面或第一方面的任意可能的实现方式中的方法。
本申请实施例第七方面提供了一种数据处理装置,包括:处理器,处理器与存储器耦合,存储器用于存储程序或指令,当程序或指令被处理器执行时,使得该数据处理装置实现上述 第一方面或第一方面的任意可能的实现方式中的方法。
其中,第二、第三、第四、第五、第六、第七方面或者其中任一种可能实现方式所带来的技术效果可参见第一方面或第一方面不同可能实现方式所带来的技术效果,此处不再赘述。
从以上技术方案可以看出,本申请实施例具有以下优点:基于第一图像、第二图像以及第一图像与第二图像之间的相对位姿,生成第三图像,该第三图像的视角在第一视角与第二视角之间。本申请能够通过已有的两个视角图像以及相对位姿合成中间视角的图像,提升输出效果的精细程度。
附图说明
图1为本申请实施例提供的一种应用场景示意图;
图2为本申请实施例中主设备与副设备一种位置关系示意图;
图3为本申请实施例提供的系统架构的结构示意图;
图4为本发明实施例提供的一种卷积神经网络结构示意图;
图5为本发明实施例提供的另一种卷积神经网络结构示意图;
图6为本申请实施例提供的数据处理方法一个流程示意图;
图7为本申请实施例提供的第一图像以及第二图像中特征点的一种示意图;
图8为本申请实施例提供的第一图像以及第二图像之间匹配对的一种示意图;
图9为本申请实施例提供的获取第三图像的一种示意图;
图10为本申请实施例提供的数据处理方法另一流程示意图;
图11为本申请实施例提供的第一原始图像与第一人物图像的一种示意图;
图12为本申请实施例提供的第一原始图像以及第二原始图像之间匹配对的一种示意图;
图13为本申请实施例提供的获取一个第三图像的另一示意图;
图14为本申请实施例提供的获取两个第三图像的另一示意图;
图15为本申请实施例提供的原始图像与背景图像的一种示意图;
图16为本申请实施例提供的目标背景图像的一种示意图;
图17为本申请实施例提供的目标背景图像的另一示意图;
图18为本申请实施例提供的目标图像的另一示意图;
图19为本申请实施例提供的目标视频的一种示意图;
图20为本申请实施例提供的目标视频的另一示意图;
图21为本申请实施例提供的数据处理装置一个结构示意图;
图22为本申请实施例提供的数据处理装置另一结构示意图;
图23为本申请实施例提供的数据处理装置另一结构示意图;
图24为本申请实施例提供的一种芯片硬件结构示意图。
具体实施方式
下面结合本发明实施例中的附图对本发明实施例进行描述。本发明的实施方式部分使用的术语仅用于对本发明的具体实施例进行解释,而非旨在限定本发明。
下面结合附图,对本申请的实施例进行描述。本领域普通技术人员可知,随着技术的发 展和新场景的出现,本申请实施例提供的技术方案对于类似的技术问题,同样适用。
本申请的说明书和权利要求书及上述附图中的术语“第一”、“第二”等是用于区别类似的对象,而不必用于描述特定的顺序或先后次序。应该理解这样使用的术语在适当情况下可以互换,这仅仅是描述本申请的实施例中对相同属性的对象在描述时所采用的区分方式。此外,术语“包括”和“具有”以及他们的任何变形,意图在于覆盖不排他的包含,以便包含一系列单元的过程、方法、系统、产品或设备不必限于那些单元,而是可包括没有清楚地列出的或对于这些过程、方法、产品或设备固有的其它单元。
下面将结合各个附图对本申请技术方案的实现原理、具体实施方式及其对应能够达到的有益效果进行详细的阐述。
图1给出了一种应用场景示意图,可以应用于人工智能领域的图像处理领域中。该应用场景可以包括云端设备100、主设备101以及与主设备101进行通信的副设备102至104。
图1中,仅以一个主设备101以及三个副设备102至104为例进行示意性说明。在实际应用中,本申请实施例中的应用场景可以有更多的主设备以及副设备,本申请实施例对主设备以及副设备的数目不进行限定。
各副设备接入云端设备的方式也可以有所不同,可以是多个副设备102至104通过主设备101接入云端设备100,也可以是多个副设备直接与云端设备连接,具体此处不做限定。
副设备102至104与主设备101之间,或主设备101与云端设备100之间一般通过无线网络连接,也可以通过有线网络连接,如果是通过无线网络连接,具体的连接形式可以为蜂窝状无线网络,或者是WiFi网络,或者是其他类型的无线网络。如果是通过有线网络连接,一般的连接形式为光纤网络。
主设备101以及副设备102至104主要的功能是拍摄图像。进一步的,主设备101以及副设备102至104还可以用于采集一个3D场景。主设备101与副设备102至104之间的位置关系可以是环形部署(例如图2所示,其中,图2所示环形部署中的主设备的数量为1个,副设备的数量为5个,具体的设备数量只是举例)、球形部署,正方体部署等,具体主设备与副设备时间的位置关系此处不做限定。
可选地,主设备101、副设备102至104中相邻两设备之间的角度小于或等于某一阈值。
在一种可能的设计中,主设备101可以控制副设备102至104触发同时拍摄,然后副设备102至104将获取到的相同时刻多个图像传送到主设备101。主设备101可以利用算法处理多个图像,得到目标图像或目标视频等数据。主设备101还可以向副设备102至104发送目标图像或目标视频等数据。
在另一种可能的设计中,主设备101可以控制副设备102至104触发同时拍摄,然后副设备102至104将获取到的相同时刻多个图像传送到主设备101。主设备101可以将多个图像上传至云端设备100,云端设备100利用算法处理多个图像,得到目标图像或目标视频等数据。云端设备100还可以向主设备101发送目标图像或目标视频等数据。使得主设备101还可以向副设备102至104发送目标图像或目标视频等数据。从而完成从采集到最终效果呈现的结果。
本申请实施例中,主设备或副设备是一种具有拍摄功能的设备,可以是摄像机、照相机、手机(mobile phone)、平板电脑(Pad)、增强现实(augmented reality,AR)终端设备或 穿戴终端设备等。
作为另一示例,本申请实施例除了可以应用于人工智能领域的图像处理领域中,还可以应用于其他需要进行中间视角合成的场景中,例如:电影电视(如:黑客帝国中的子弹时间)、体育赛事直播(如:Intel TrueView)或房产交易平台所应用的3D视角等场景。此处不再对其他场景进行一一列举。
由于本申请实施例涉及神经网络的应用,为了便于理解,下面先对本申请实施例主要涉及的神经网络的相关术语和概念进行介绍。
(1)神经网络
神经网络可以是由神经单元组成的,神经单元可以是指以X s和截距1为输入的运算单元,该运算单元的输出可以为:
Figure PCTCN2021095141-appb-000001
其中,s=1、2、……n,n为大于1的自然数,W s为X s的权重,b为神经单元的偏置。f为神经单元的激活函数(activation functions),用于将非线性特性引入神经网络中,来将神经单元中的输入信号转换为输出信号。该激活函数的输出信号可以作为下一层卷积层的输入。激活函数可以是sigmoid函数。神经网络是将许多个上述单一的神经单元联结在一起形成的网络,即一个神经单元的输出可以是另一个神经单元的输入。每个神经单元的输入可以与前一层的局部接受域相连,来提取局部接受域的特征,局部接受域可以是由若干个神经单元组成的区域。
(2)深度神经网络
深度神经网络(deep neural network,DNN),也称多层神经网络,可以理解为具有很多层隐含层的神经网络,这里的“很多”并没有特别的度量标准。从DNN按不同层的位置划分,DNN内部的神经网络可以分为三类:输入层,隐含层,输出层。一般来说第一层是输入层,最后一层是输出层,中间的层数都是隐含层。层与层之间是全连接的,也就是说,第i层的任意一个神经元一定与第i+1层的任意一个神经元相连。虽然DNN看起来很复杂,但是就每一层的工作来说,其实并不复杂,简单来说就是如下线性关系表达式:
Figure PCTCN2021095141-appb-000002
其中,
Figure PCTCN2021095141-appb-000003
是输入向量,
Figure PCTCN2021095141-appb-000004
是输出向量,
Figure PCTCN2021095141-appb-000005
是偏移向量,W是权重矩阵(也称系数),α()是激活函数。每一层仅仅是对输入向量经过如此简单的操作得到输出向量。由于DNN层数多,则系数W和偏移向量
Figure PCTCN2021095141-appb-000006
的数量也就很多了。这些参数在DNN中的定义如下所述:以系数W为例:假设在一个三层的DNN中,第二层的第4个神经元到第三层的第2个神经元的线性系数定义为W 24 3。上标3代表系数所在的层数,而下标对应的是输出的第三层索引2和输入的第二层索引4。总结就是:第L-1层的第k个神经元到第L层的第j个神经元的系数定义为W jk L。需要注意的是,输入层是没有W参数的。在深度神经网络中,更多的隐含层让网络更能够刻画现实世界中的复杂情形。理论上而言,参数越多的模型复杂度越高,“容量”也就越大,也就意味着它能完成更复杂的学习任务。训练深度神经网络的也就是学习权重矩阵的过程,其最终目的是得到训练好的深度神经网络的所有 层的权重矩阵(由很多层的向量W形成的权重矩阵)。
(3)卷积神经网络
卷积神经网络(convolutional neuron network,CNN)是一种带有卷积结构的深度神经网络。卷积神经网络包含了一个由卷积层和子采样层构成的特征抽取器。该特征抽取器可以看作是滤波器,卷积过程可以看作是使用一个可训练的滤波器与一个输入的图像或者卷积特征平面(feature map)做卷积。卷积层是指卷积神经网络中对输入信号进行卷积处理的神经元层。在卷积神经网络的卷积层中,一个神经元可以只与部分邻层神经元连接。一个卷积层中,通常包含若干个特征平面,每个特征平面可以由一些矩形排列的神经单元组成。同一特征平面的神经单元共享权重,这里共享的权重就是卷积核。共享权重可以理解为提取图像信息的方式与位置无关。这其中隐含的原理是:图像的某一部分的统计信息与其他部分是一样的。即意味着在某一部分学习的图像信息也能用在另一部分上。所以对于图像上的所有位置,都能使用同样的学习得到的图像信息。在同一卷积层中,可以使用多个卷积核来提取不同的图像信息,一般地,卷积核数量越多,卷积操作反映的图像信息越丰富。
卷积核可以以随机大小的矩阵的形式初始化,在卷积神经网络的训练过程中卷积核可以通过学习得到合理的权重。另外,共享权重带来的直接好处是减少卷积神经网络各层之间的连接,同时又降低了过拟合的风险。
(4)损失函数
在训练深度神经网络的过程中,因为希望深度神经网络的输出尽可能的接近真正想要预测的值,所以可以通过比较当前网络的预测值和真正想要的目标值,再根据两者之间的差异情况来更新每一层神经网络的权重向量(当然,在第一次更新之前通常会有初始化的过程,即为深度神经网络中的各层预先配置参数),比如,如果网络的预测值高了,就调整权重向量让它预测低一些,不断的调整,直到深度神经网络能够预测出真正想要的目标值或与真正想要的目标值非常接近的值。因此,就需要预先定义“如何比较预测值和目标值之间的差异”,这便是损失函数(loss function)或目标函数(objective function),它们是用于衡量预测值和目标值的差异的重要方程。其中,以损失函数举例,损失函数的输出值(loss)越高表示差异越大,那么深度神经网络的训练就变成了尽可能缩小这个loss的过程。
(5)反向传播算法
卷积神经网络可以采用误差反向传播(back propagation,BP)算法在训练过程中修正初始的超分辨率模型中参数的大小,使得超分辨率模型的重建误差损失越来越小。具体地,前向传递输入信号直至输出会产生误差损失,通过反向传播误差损失信息来更新初始的超分辨率模型中参数,从而使误差损失收敛。反向传播算法是以误差损失为主导的反向传播运动,旨在得到最优的超分辨率模型的参数,例如权重矩阵。
(6)像素值
图像的像素值可以是一个红绿蓝(RGB)颜色值,像素值可以是表示颜色的长整数。例如,像素值为256*Red+100*Green+76Blue,其中,Blue代表蓝色分量,Green代表绿色分量,Red代表红色分量。各个颜色分量中,数值越小,亮度越低,数值越大,亮度越高。对于灰度图像来说,像素值可以是灰度值。
参见附图3,本发明实施例提供了一种系统架构100。如所述系统架构100所示,数据采 集设备160用于采集训练数据,本申请实施例中训练数据包括:第一训练图像以及第二训练图像。其中,第一训练图像可以为第一图像,第二训练图像可以为第二图像。也可以理解为第一训练图像与第一图像是针对同一场景相同的一个视角下采集的图像,第二训练图像与第二图像是针对该同一场景相同的另一个视角下采集的图像。并将训练数据存入数据库130,训练设备120基于数据库130中维护的训练数据训练得到目标模型/规则101。下面将以实施例一更详细地描述训练设备120如何基于训练数据得到目标模型/规则101,该目标模型/规则101能够用于实现本申请实施例提供的数据处理方法,即,将相同时刻不同视角的两个图像通过相关预处理后输入该目标模型/规则101,即可得到中间视角的图像。本申请实施例中的目标模型/规则101具体可以为光流计算网络和/或图像修复网络,在本申请提供的实施例中,该光流计算网络和/或图像修复网络是通过训练第一训练图像以及第二训练图像得到的。需要说明的是,在实际的应用中,所述数据库130中维护的训练数据不一定都来自于数据采集设备160的采集,也有可能是从其他设备接收得到的。另外需要说明的是,训练设备120也不一定完全基于数据库130维护的训练数据进行目标模型/规则101的训练,也有可能从云端或其他地方获取训练数据进行模型训练,上述描述不应该作为对本申请实施例的限定。
根据训练设备120训练得到的目标模型/规则101可以应用于不同的系统或设备中,如应用于图3所示的执行设备110,所述执行设备110可以是终端,如手机终端,平板电脑,笔记本电脑,AR/VR,车载终端等,还可以是服务器或者云端等。在附图3中,执行设备110配置有I/O接口112,用于与外部设备进行数据交互,用户可以通过客户设备140向I/O接口112输入数据,所述输入数据在本申请实施例中可以包括:第一图像以及第二图像,可以是用户输入的,也可以是用户通过拍摄设备上传的,当然还可以来自数据库,具体此处不做限定。
预处理模块113用于根据I/O接口112接收到的输入数据(如第一图像以及第二图像)进行预处理,在本申请实施例中,预处理模块113可以用于将输入的多个数据进行尺寸修剪等操作(例如当各副设备或主设备输出的第一图像以及第二图像大小规格等不一致时,预处理模块113还可以用于将第一图像以及第二图像进行归一化处理)。
在执行设备110对输入数据进行预处理,或者在执行设备110的计算模块111执行计算等相关的处理过程中,执行设备110可以调用数据存储系统150中的数据、代码等以用于相应的处理,也可以将相应处理得到的数据、指令等存入数据存储系统150中。
最后,I/O接口112将处理结果,如上述得到的第三图像返回给客户设备140,从而提供给用户。
值得说明的是,训练设备120可以针对不同的目标或称不同的任务,基于不同的训练数据生成相应的目标模型/规则101,该相应的目标模型/规则101即可以用于实现上述目标或完成上述任务,从而为用户提供所需的结果。
在附图3中所示情况下,用户可以手动给定输入数据,该手动给定可以通过I/O接口112提供的界面进行操作。另一种情况下,客户设备140可以自动地向I/O接口112发送输入数据,如果要求客户设备140自动发送输入数据需要获得用户的授权,则用户可以在客户设备140中设置相应权限。用户可以在客户设备140查看执行设备110输出的结果,具体的呈现形式可以是显示、声音、动作等具体方式。客户设备140也可以作为数据采集端,采集 如图所示输入I/O接口112的输入数据及输出I/O接口112的输出结果作为新的样本数据,并存入数据库130。当然,也可以不经过客户设备140进行采集,而是由I/O接口112直接将如图所示输入I/O接口112的输入数据及输出I/O接口112的输出结果,作为新的样本数据存入数据库130。
值得注意的是,附图3仅是本发明实施例提供的一种系统架构的示意图,图中所示设备、器件、模块等之间的位置关系不构成任何限制,例如,在附图3中,数据存储系统150相对执行设备110是外部存储器,在其它情况下,也可以将数据存储系统150置于执行设备110中。
如图3所示,根据训练设备120训练得到目标模型/规则101,该目标模型/规则101在本申请实施例中可以是光流计算网络和/或图像修复网络,具体的,在本申请实施例提供的网络中,光流计算网络和图像修复网络都可以是卷积神经网络。
一种可能实现的方式中,图3中的执行设备110可以是前述图1所示的云端设备,客户设备140可以是前述图1所示的主设备或副设备,即本申请提供的方法主要应用于云端设备。
另一种可能实现的方式中,图3中的执行设备110可以是前述图1所示的主设备,客户设备140可以是前述图1所示的副设备,即本申请提供的方法主要应用于主设备。
如前文的基础概念介绍所述,卷积神经网络是一种带有卷积结构的深度神经网络,是一种深度学习(deep learning)架构,深度学习架构是指通过机器学习的算法,在不同的抽象层级上进行多个层次的学习。作为一种深度学习架构,CNN是一种前馈(feed-forward)人工神经网络,该前馈人工神经网络中的各个神经元可以对输入其中的图像作出响应。
如图4所示,卷积神经网络(CNN)100可以包括输入层110,卷积层/池化层120,其中池化层为可选的,以及神经网络层130。
卷积层/池化层120:
卷积层:
如图4所示卷积层/池化层120可以包括如示例121-126层,在一种实现中,121层为卷积层,122层为池化层,123层为卷积层,124层为池化层,125为卷积层,126为池化层;在另一种实现方式中,121、122为卷积层,123为池化层,124、125为卷积层,126为池化层。即卷积层的输出可以作为随后的池化层的输入,也可以作为另一个卷积层的输入以继续进行卷积操作。
以卷积层121为例,卷积层121可以包括很多个卷积算子,卷积算子也称为核,其在图像处理中的作用相当于一个从输入图像矩阵中提取特定信息的过滤器,卷积算子本质上可以是一个权重矩阵,这个权重矩阵通常被预先定义,在对图像进行卷积操作的过程中,权重矩阵通常在输入图像上沿着水平方向一个像素接着一个像素(或两个像素接着两个像素……这取决于步长stride的取值)的进行处理,从而完成从图像中提取特定特征的工作。该权重矩阵的大小应该与图像的大小相关,需要注意的是,权重矩阵的纵深维度(depth dimension)和输入图像的纵深维度是相同的,在进行卷积运算的过程中,权重矩阵会延伸到输入图像的整个深度。因此,和一个单一的权重矩阵进行卷积会产生一个单一纵深维度的卷积化输出,但是大多数情况下不使用单一权重矩阵,而是应用维度相同的多个权重矩阵。每个权重矩阵的输出被堆叠起来形成卷积图像的纵深维度。不同的权重矩阵可以用来提取图像中不同的特 征,例如一个权重矩阵用来提取图像边缘信息,另一个权重矩阵用来提取图像的特定颜色,又一个权重矩阵用来对图像中不需要的噪点进行模糊化……该多个权重矩阵维度相同,经过该多个维度相同的权重矩阵提取后的特征图维度也相同,再将提取到的多个维度相同的特征图合并形成卷积运算的输出。
这些权重矩阵中的权重值在实际应用中需要经过大量的训练得到,通过训练得到的权重值形成的各个权重矩阵可以从输入图像中提取信息,从而帮助卷积神经网络100进行正确的预测。
当卷积神经网络100有多个卷积层的时候,初始的卷积层(例如121)往往提取较多的一般特征,该一般特征也可以称之为低级别的特征;随着卷积神经网络100深度的加深,越往后的卷积层(例如126)提取到的特征越来越复杂,比如高级别的语义之类的特征,语义越高的特征越适用于待解决的问题。
池化层:
由于常常需要减少训练参数的数量,因此卷积层之后常常需要周期性的引入池化层,即如图4中120所示例的121-126各层,可以是一层卷积层后面跟一层池化层,也可以是多层卷积层后面接一层或多层池化层。在图像处理过程中,池化层的唯一目的就是减少图像的空间大小。池化层可以包括平均池化算子和/或最大池化算子,以用于对输入图像进行采样得到较小尺寸的图像。平均池化算子可以在特定范围内对图像中的像素值进行计算产生平均值。最大池化算子可以在特定范围内取该范围内值最大的像素作为最大池化的结果。另外,就像卷积层中用权重矩阵的大小应该与图像大小相关一样,池化层中的运算符也应该与图像的大小相关。通过池化层处理后输出的图像尺寸可以小于输入池化层的图像的尺寸,池化层输出的图像中每个像素点表示输入池化层的图像的对应子区域的平均值或最大值。
神经网络层130:
在经过卷积层/池化层120的处理后,卷积神经网络100还不足以输出所需要的输出信息。因为如前所述,卷积层/池化层120只会提取特征,并减少输入图像带来的参数。然而为了生成最终的输出信息(所需要的类信息或别的相关信息),卷积神经网络100需要利用神经网络层130来生成一个或者一组所需要的类的数量的输出。因此,在神经网络层130中可以包括多层隐含层(如图4所示的131、132至13n)以及输出层140,该多层隐含层中所包含的参数可以根据具体的任务类型的相关训练数据进行预先训练得到,例如该任务类型可以包括图像识别,图像分类,图像超分辨率重建等等……
在神经网络层130中的多层隐含层之后,也就是整个卷积神经网络100的最后层为输出层140,该输出层140具有类似分类交叉熵的损失函数,具体用于计算预测误差,一旦整个卷积神经网络100的前向传播(如图4由110至140的传播为前向传播)完成,反向传播(如图4由140至110的传播为反向传播)就会开始更新前面提到的各层的权重值以及偏差,以减少卷积神经网络100的损失及卷积神经网络100通过输出层输出的结果和理想结果之间的误差。
需要说明的是,如图4所示的卷积神经网络100仅作为一种卷积神经网络的示例,在具体的应用中,卷积神经网络还可以以其他网络模型的形式存在,例如,如图5所示的多个卷积层/池化层并行,将分别提取的特征均输入给全神经网络层130进行处理。
CNN的算法处理可以应用于图1所示的主设备101或云端设备100中。
下面结合图1的应用场景,对本申请实施例中的数据处理方法进行描述:
请参阅图6,本申请实施例中数据处理方法一个实施例包括:
601、数据处理装置获取第一图像以及第二图像。
下面仅以数据处理装置是图1所示场景中的主设备、第一拍摄设备与第二拍摄设备是图1所示场景中的任意两个副设备为例进行示意性说明,可以理解的是,数据处理装置还可以是图1所示场景中的云端设备,第一拍摄设备与第二拍摄设备可以是主设备或副设备。具体此处不做限定。
本实施例中,第一图像可以是第一拍摄设备在第一视角下采集拍摄对象的图像,第二图像可以是第二拍摄设备在第二视角下采集拍摄对象的图像,且第一拍摄设备采集第一图像的时刻与第二拍摄设备采集第二图像的时刻相同(或者第一图像的采集时刻与第二图像的采集时刻之间的时间间隔小于或等于预设阈值)。即第一图像与第二图像为多个拍摄设备在同时刻且不同视角下获得的图像。
其中,上述的拍摄设备也可以称为采集设备。
可选地,第一图像与第二图像是第一拍摄设备与第二拍摄设备对同一拍摄对象在同一时刻不同视角下采集的图像,且第一图像与第二图像存在交叠内容。其中,拍摄对象可以是指人物、动物、物品等对象,具体此处不做限定。
本申请实施例中,第一图像与第二图像存在交叠内容可以理解为第一图像与第二图像中的画面内容部分相同,例如:第一图像与第二图像重叠的内容(或区域、面积)大于或等于某一阈值(例如20%)。第一图像与第二图像存在交叠内容还可以理解为第一图像与第二图像的画面内容具有同一拍摄对象。可选地,第一图像与第二图像中拍摄对象的朝向有交叠。
可选地,第一距离与第二距离的差值小于或等于某个预设阈值,其中,第一距离为第一拍摄设备采集第一图像时第一拍摄设备与参考点的距离,第二距离为第二拍摄设备采集第二图像时第二拍摄设备与参考点的距离。参考点可以是指人物拍摄对象所在的某一个位置,例如拍摄对象是人物,参考点可以是该人物所在的位置,例如舞台的中间位置。或者,可以理解为第一拍摄设备采集第一图像时的位置与第二拍摄设备采集第二图像时的位置共处于以拍摄对象为内侧的弧线上。
可选地,第一图像与第二图像的视场角的重叠角度大于某一阈值(例如:第一视角与第二视角的重叠角度大于30度);和/或采集两个图像的拍摄设备旋转角度的差异小于预设角度。其中,旋转角度可以是拍摄设备水平角旋转的角度值,也可以是拍摄设备俯视角旋转的角度值。
可以理解的是,第一拍摄设备采集第一图像的时刻与第二拍摄设备采集第二图像的时刻相同,也可以认为是第一图像的采集时刻与第二图像的采集时刻之间的时间间隔小于或等于预设阈值,该预设阈值根据实际需要设置,具体此处不做限定。
第一拍摄设备与第二拍摄设备采集到第一图像以及第二图像之后,向数据处理装置发送第一图像以及第二图像。
602、数据处理装置获取第一图像与第二图像之间的相对位姿。
本申请实施例中,相对位姿包括第一相对位姿以及第二相对位姿,第一相对位姿为第一 图像相对于第二图像的位姿,第二相对位姿为第二图像相对于第一图像的位姿。
可选地,第一相对位姿是第一拍摄设备采集第一图像时的位姿,第二相对位姿为第二拍摄设备采集第二图像时的位姿。即第二图像的位姿是指第二拍摄设备采集第二图像时的位姿。第一图像的位姿是指第一拍摄设备采集第一图像时的位姿。第一图像与第二图像之间的相对位姿是指第一位姿与第二位姿之间的相对位姿。
其中,本申请实施例中所描述的第一图像与第二图像之间的相对位姿实质上是第一位姿与第二位姿的相对位姿,第一位姿为第一采集设备采集第一图像时的位姿;第二位姿为第二采集设备采集第二图像时的位姿。
本申请实施例中的相对位姿可以包括基础矩阵或变换矩阵(H)等参数,也可以理解为基础矩阵或变换矩阵等参数可以用来描述相对位姿。即若用变换矩阵描述相对位姿,变换矩阵包括第一变换矩阵以及第二变换矩阵,第一变换矩阵为第一图像相对于第二图像的矩阵,第二变换矩阵为第二图像相对于第一图像的矩阵。
本申请实施例中,数据处理装置获取相对位姿的方式有很多。下面仅以采用运动推断结构(structure from motion,SFM)算法为例进行示意性说明。
数据处理装置可以通过特征点提取以及SFM的方式估计第一图像与第二图像之间的相对位姿。
使用尺度不变特征变换(scale-invariant feature transform,SIFT)特征检测器提取第一图像以及第二图像的特征点,并计算该特征点对应的描述子(descriptor),使用近似最近邻算法(approximate nearest neighbor,ANN)方法进行匹配,得到匹配对。然后,将低于预设值的匹配对删除。对保留下来的匹配使用随机抽样一致算法(RANdom Sample Consensus,RANSAC)进行过滤误匹配,得到目标匹配对。通过八点法得到变换矩阵,具体得到变换矩阵的方式有多种,下面以两种方式为例进行示意性说明:
1、将第一图像中的任意四个点坐标(其中至少三个点不在同一直线上),并且在第二图像上指定四给点(与第一图像的四个点对应),通过八个点求出变换矩阵。
2、通过八点法先得到基础矩阵(fundamental matrix),在利用基础矩阵变换得到变换矩阵。
RANSAC算法能有效地消除错误点给模型参数带来的偏差,通过RANSAC算法以及八点法获得的变换矩阵更加准确。
示例性的,如图7所示,数据处理装置先获取第一图像以及第二图像的SIFT特征点,在通过ANN方法匹配得到如图8所示保留下来的匹配对。再对保留下来的匹配对使用RANSAC和八点法估计变换矩阵,从而得到第一拍摄设备与第二拍摄设备之间的相对位姿(即RT矩阵)。
603、数据处理装置基于第一图像、第二图像以及相对位姿,生成第三图像。
数据处理装置在获取第一图像、第二图像以及相对位姿之后,数据处理装置可以将第一图像以及第二图像输入训练好的光流计算网络进行光流计算,得到初始光流图(例如图9所示的初始光流图)。其中,初始光流图可以用于描述像素点的位移过程,初始光流图与第一图像以及第二图像的尺寸一致。
数据处理装置可以通过前置图像扭曲(forwad warping)方法处理第一图像以及初始光流图,得到第一目标光流图像(例如图9所示的I1)。通过forwad warping方法处理第二图 像以及初始光流图,得到第二目标光流图像(例如图9所示的I2)。可以理解为,通过初始光流图,得到第一图像的第一视角到第二图像的第二视角之间,第一图像中各像素点的运动方向以及距离。因此可以提供更多像素点的光流信息,使得后续生成的第三图像中像素点之间更加平缓。
可以理解的是,根据forwad warping可以生成一个或多个目标光流图像,上述的第一目标光流图像和第二目标光流图像只是举例,目标光流图像的具体数量此处不做限定。
下面简单介绍一下使用forwad warping处理第一图像以及初始光流图,得到第一目标光流图像的基本原理。第一目标光流图像可以通过下述的第一转换公式得到。
Figure PCTCN2021095141-appb-000007
其中,x 1与y 1表示第一图像中某一个像素点P的坐标(也可以称为P点的旧坐标),t x与t y表示P点的旧坐标(x 1,y 1)下的光流在X轴方向以及Y轴方向移动距离的大小。因为第一图像的尺寸与初始光流图的尺寸是一样的,所以(x 1,y 1)与(t x,t y)可以一一对应。
通过上述第一转换公式可以看出,第一目标光流图像中P点的新坐标就是(x 2,y 2),即x 2=x 1+t x,y 2=y 1+t y
同理,第一图像中的每个像素点类似于上述像素点P的操作,根据第一图像与第一目标光流图像中各像素点的坐标变换关系,从第一图像映射到第一目标光流图像,对像素点进行赋值,赋值的过程中通过插值运算(例如最近邻差值法、双线性插值法、双三次插值法等)确定第一目标光流图像中各像素点的值,进而生成第一目标光流图像。可以理解的是,第二目标光流图像的生成方式与第一目标光流图像的生成方式类似,此处不再赘述。
数据处理装置还可以通过图片图像扭曲(image warping)方法处理第一图像以及第一相对位姿得到第一扭曲图像(例如图9所示的I0)。通过image warping方法处理第二图像以及第二相对位姿得到第二扭曲图像(例如图9所示的I3)。可以理解为,由于利用相对位姿获取的第一扭曲图像以及第二扭曲图像,可以为后续图像修复网络提供更多的图像纹理信息,便于图像修复网络处理掉更多瑕疵。
可以理解的是,根据image warping可以生成一个或多个扭曲图像,上述的第一扭曲图像和第二扭曲图像只是举例,扭曲图像的具体数量此处不做限定。
下面简单介绍一下使用image warping处理第一图像以及第一相对位姿,得到第一扭曲图像的基本原理。可选地,第一相对位姿为第一变换矩阵(即H为3×3的矩阵),第一扭曲图像可以通过下述的第二转换公式得到。
Figure PCTCN2021095141-appb-000008
其中,上述第二变换公式中的x为第一图像中某一个像素点Q的旧坐标,H为前面求得的变换矩阵(可以用于描述相对位姿),x'为第一扭曲图像中像素点Q的新坐标点。另外,上述第二转换公式中H矩阵的最后一个元素h 33始终为1。
公式运算这里不做过多解释,利用第二转换公式,可以求解第一扭曲图像中像素点Q的新坐标点。同理,第一图像中的每个像素点类似于上述像素点Q的操作,根据第一图像与第一扭曲图像中各像素点的坐标变换关系,从第一图像映射到第一扭曲图像,对像素点进行赋值,赋值的过程中通过插值运算(例如最近邻差值法、双线性插值法、双三次插值法等)确定第一扭曲图像中各像素点的值,进而生成第一扭曲图像。可以理解的是,第二扭曲图像的生成方式与第一扭曲图像的生成方式类似,此处不再赘述。
数据处理装置将第一目标光流图像、第一扭曲图像、第二目标光流图像以及第二扭曲图像输入训练好的图像修复网络进行图像修复,得到第一视角与第二视角之间的中间视角(即第三视角)对应的第三图像(例如图9所示,将I0、I1、I2、I3输入训练好的图像修复网络中,得到第三图像)。其中,光流估计网络和图像修复网络都采用了基于Unet结构的CNN网络。
其中,中间视角可以理解为,第一图像的第一平面法向量与第二图像的第二平面法向量经过平移后,两个平面法向量之间的任意一条射线对应的图像都可以称为是第一视角与第二视角之间第三视角对应的第三图像。
由于第一扭曲图像和第二扭曲图像是根据第一图像、第二图像以及相对位姿生成的,所以第一扭曲图像和第二扭曲图像的特征覆盖比较全面,且第一目标光流图像和第二目标光流图像是根据光流信息得到的,第一目标光流图像和第二目标光流图像的细节特征(即光流特征)比较全面,所以结合特征比较全的第一扭曲图像和第二扭曲图像以及细节特征比较明显的第一目标光流图像和第二目标光流图像,可以实现第一扭曲图像和第二扭曲图像以及第一目标光流图像和第二目标光流图像之间的信息互补,有助于后续图像修复网络生成的中间视角图像具备更多特征以及细节的特征。
本申请实施例中,第三图像的数量根据实际需要设置,具体此处不做限定。
示例性的,如图9所示,延续图7与图8的示例,通过步骤603得到图9所示的第三视角对应的第三图像。
上述训练好的光流计算网络以及训练好的图像修复网络是通过以第一训练图像以及第二训练图像作为光流计算网络的输入,以损失函数的值小于第二阈值为目标对光流计算网络以及图像修复网络进行联合训练得到,该损失函数用于指示图像修复网络输出的图像与第三目标图像之间的差异,第三目标图像为在第一目标图像对应第一目标视角和第二目标图像对应第二目标视角之间的视角下采集的图像。
其中,光流计算网络以及图像修复网络进行联合训练是指:将光流计算网络与图像修复网络作为一个整体网络进行训练,也可以理解为,相较于中间生成的两个目标光流图像,联合训练更看重整体网络输出的第三图像的效果。
进一步的,光流计算网络以及图像修复网络采用端到端整体训练的方式完成,首先利用标定好的多相机系统采集大量的训练数据集,训练数据集主要有多组3张图像构成,一幅左图像(即第一训练图像),一幅右图像(第二训练图像)和一副中间图像(第三目标图像)。在训练阶段,将左图像和右图像作为输入,中间图像作为输出来监督整个网络的端到端学习。当然,训练时若是输入多张,输出一张,则具体实现得到的第三图像为一张。若训练时若是输入多张,输出多张,则具体实现得到的第三图像为多张。实际应用中,第三图像是一张还是多张可以根据训练时的输入输出数量进行相应调整,具体此处不做限定。
数据处理装置生成第三图像之后,可以向第一拍摄设备和/或第二拍摄设备发送该第三图像,使得使用第一拍摄设备和/或第二拍摄设备的用户可以查看第三图像。
本申请实施例中,基于第一图像、第二图像以及第一图像与第二图像之间的相对位姿,生成第三图像,该第三图像的视角在第一视角与第二视角之间。通过已有的视角图像以及相对位姿合成其他视角图像,提升输出效果的精细程度。
进一步的,结合特征比较全的第一扭曲图像和第二扭曲图像以及细节特征比较明显的第一目标光流图像和第二目标光流图像,可以实现第一扭曲图像和第二扭曲图像以及第一目标光流图像和第二目标光流图像之间的信息互补,有助于后续图像修复网络生成的中间视角图像具备更多特征以及细节的特征,便于图像修复网络处理掉更多瑕疵,使得生成的第三图像更加平缓。
全景精彩瞬间是一种利用计算机视觉技术来实现强化的慢镜头、时间暂停的特效,该技术被应用于电影电视(如:黑客帝国中的子弹时间)、体育赛事直播(如:Intel TrueView)等领域。
目前,获取全景精彩瞬间的方式是:提前挑选好场地(如:篮球场),并在场地周围的固定位置设置多个高清摄像机,利用大量昂贵的多个高清摄像机来同步聚焦一个场景。再利用3D建模的方法重新创建等体积的3D人物图像(如:篮球运动员)。对场景以及3D人物图像进行渲染,得到全景精彩瞬间。使得观众体验到了传统直播无法带来的震撼及身临其境的感觉。
然而,上述方式中的高清摄像机需要提前设置好固定位置,如果需要在其他场景下获取全景精彩视频,需要重新设置高清摄像头的位置,使得上述方式的应用场景不够灵活。
本申请实施例提供了一种数据处理方法及相关设备。可以用于生成中间视角的图像。
针对上述问题,本申请还提供一种数据处理方法,可以通过移动的设备(例如手机)获取全景精彩视频。
请参阅图10,本申请实施例中数据处理方法另一实施例包括:
1001、数据处理装置获取第一图像以及第二图像。
本申请实施例中的数据处理装置可以是图1所示场景中的主设备101或云端设备100,具体此处不做限定。
本实施例中第一图像与第二图像可以是直接通过拍摄设备采集得到的,也可以是通过拍 摄设备采集的图像处理得到。即第一图像由第一视角下第一拍摄设备采集的图像处理得到,第二图像由第二视角下第二拍摄设备采集的图像处理得到。或者第一图像与第二图像为第一拍摄设备与第二拍摄设备针对同一拍摄对象在同一时刻不同视角下采集的图像,且所述第一图像与所述第二图像存在交叠内容。
下面以第一图像与第二图像是通过拍摄设备采集的图像处理得到为例具体描述第一图像与第二图像的获取过程。另外,本申请实施例中的数据处理装置可以是第一拍摄设备、第二拍摄设备、与第一拍摄设备以及第二拍摄设备连接的目标拍摄设备(即图1所示场景中的主设备101)或云端设备,具体此处不做限定。
第一拍摄设备在第一视角下采集第一原始图像,该第一原始图像包括目标人物以及除了目标人物以外的第一背景。第二拍摄设备在第二视角下采集第二原始图像,该第二原始图像包括目标人物以及除了目标人物以外的第二背景。该目标人物相当于前面的拍摄对象。
本申请实施例中数据处理装置获取第一图像以及第二图像的方式有多种,下面分别描述:
1、数据处理装置从第一原始图像以及第二原始图像中提取第一图像以及第二图像。
数据处理装置获取第一拍摄设备采集的第一原始图像以及第二拍摄设备采集的第二原始图像。并提取第一原始图像中的第一人物图像以及第二原始图像中的第二人物图像,该第一人物图像以及第二人物图像都包括目标人物。数据处理装置确定第一人物图像为第一图像,确定第二人物图像为第二图像。
示例性的,如图11所示,数据处理装置可以分割第一原始图像,得到第一人物图像以及第一背景图像。数据处理装置可以分割第二原始图像,得到第二人物图像以及第二背景图像。并确定第一人物图像为第一图像,确定第二人物图像为第二图像。
当然,数据处理装置也可以直接从第一原始图像中提取出第一人物图像,提取第一人物图像采用的方式,具体此处不做限定。数据处理装置还可以先采用基于CNN的人像分割算法分割第一原始图像以及第二原始图像,分别得到第一二值分割图以及第二二值分割图,两个分割图的前景区域的像素值为1(目标人物的区域),背景区域的像素值为0(除了目标人物以外的背景区域)。并根据第一图像以及第一二值分割图得到得到第一人物图像,根据第二图像以及第二二值分割图得到得到第二人物图像。数据处理装置再确定第一人物图像为第一图像,确定第二人物图像为第二图像。
2、数据处理装置从第一拍摄设备与第二拍摄设备获取第一图像以及第二图像。
第一拍摄设备从第一原始图像提取出第一人物图像,在向数据处理装置发送第一人物图像,也可以是第二拍摄设备从第二原始图像提取出第二人物图像,在向数据处理装置发送第二人物图像。数据处理装置确定第一人物图像为第一图像,确定第二人物图像为第二图像。
本申请实施例中数据处理装置获取第一图像以及第二图像的方式有多种,上述两种只是举例,具体此处不做限定。
1002、数据处理装置获取第一图像与第二图像之间的相对位姿。
本申请实施例中,数据处理装置获取相对位姿的方式有很多。下面仅以SFM算法为例进行示意性说明。
可以通过特征点提取以及SFM的方式估计第一图像与第二图像之间的相对位姿。
使用SIFT特征检测器提取第一图像以及第二图像的特征点,并计算该特征点对应的描述 子(descriptor),使用ANN方法进行匹配,得到匹配对(如图12所示)。然后,将低于预设值的匹配对删除。对保留下来的匹配使用RANSAC和八点法估计基本矩阵,从而得到第一拍摄设备与第二拍摄设备之间的相对位姿。
本实施例中的步骤1002与前述图6对应步骤602类似,此处不再赘述。
1003、数据处理装置基于第一图像、第二图像以及相对位姿,生成第三图像。
本实施例中的步骤1003中生成第三图像的方法与前述图6对应步骤603生成第三图像的方法类似,下面结合附图描述下方法流程,具体原理以及实现方式可以参考前述图6对应步骤603,此处不再过多赘述。
数据处理装置可以通过forwad warping方法处理第一图像以及初始光流图,得到第一目标光流图像(例如图13所示的I5)。通过forwad warping方法处理第二图像以及初始光流图,得到第二目标光流图像(例如图13所示的I6)。可以理解为,通过初始光流图,得到第一图像的第一视角到第二图像的第二视角之间,第一图像中各像素点的运动方向以及距离。因此可以提供更多像素点的光流信息,使得后续生成的第三图像中像素点之间更加平缓。
数据处理装置还可以通过image warping方法处理第一图像以及第一相对位姿得到第一扭曲图像(例如图13所示的I4)。通过image warping方法处理第二图像以及第二相对位姿得到第二扭曲图像(例如图13所示的I7)。可以理解为,由于利用相对位姿获取的第一扭曲图像以及第二扭曲图像,可以为后续图像修复网络提供更多的图像纹理信息,便于图像修复网络处理掉更多瑕疵。
数据处理装置将第一目标光流图像、第一扭曲图像、第二目标光流图像以及第二扭曲图像输入训练好的图像修复网络进行图像修复,得到第一视角与第二视角之间的中间视角(即第三视角)对应的第三图像(例如图13所示,将I4、I5、I6、I7输入训练好的图像修复网络中,得到第三图像)。
本申请实施例中第三目标的数量可以是一个或更多(例如图14所示,第三图像的数量为2个),具体此处不做限定。
1004、数据处理装置拼接第一背景图像以及第二背景图像,得到目标背景图像。
示例性的,如图11所示,数据处理装置从第一原始图像提取出第一人物图像之后,得到第一背景图像,从第二原始图像提取出第二人物图像之后,得到第二背景图像。
可选地,第一人物图像与第二人物图像存在重叠内容,例如第一人物图像与第二人物图像都存在同一人物。
可选地,对于第一图像与第二图像是拍摄设备采集得到的情况,上述中的第一原始图像也可以理解是第一图像,第二原始图像可以理解为是第二图像。
数据处理装置也可以直接从第一原始图像中提取第一背景图像,直接从第二原始图像中提取第二背景图像。
可以理解的是,如图15所示,数据处理装置还可以简单分割第一原始图像得到第一空洞图像,再根据第一原始图像填充第一空洞图像,得到第一背景图像。数据处理装置还可以简单分割第二原始图像得到第二空洞图像,再根据第二原始图像填充第二空洞图像,得到第二背景图像。
可选地,第一空洞图像也可以理解为是从第一原始图像中扣除或者分离拍摄对象的区域 后得到的图像。
数据处理装置根据第一原始图像填充第一空洞图像,得到第一背景图像的具体过程也可以采用CNN来实现背景空洞的填充工作,具体此处不做限定。
数据处理装置获取第一背景图像以及第二背景图像之后,可以直接拼接第一背景图像以及第二背景图像,得到目标背景图像。
可选地,对第一背景图像以及第二背景图像进行SIFT特征点提取,然后进行特征点匹配,再将第一背景图像以及第二背景图像的重叠边界进行特殊处理(例如平滑处理),使得第一背景图像以及第二背景图像拼接成目标背景图像(如图16所示)。
可选地,为了第一背景图像以及第二背景图像拼接准确,可以参考相对位姿拼接第一背景图像以及第二背景图像得到目标背景图像。
示例性的,当背景图像为三张时,拼接成的目标背景图像如图17所示。
1005、数据处理装置融合第三图像以及目标背景图像,得到目标图像。
数据处理装置得到目标背景图像以及第三图像之后,融合目标背景图像以及第三图像之后得到目标图像。
示例性的,如图18所示,数据处理装置融合第三图像以及目标背景图像,得到目标图像。
可选地,数据处理装置利用泊松融合(Poisson Blending)将第三图像融合到目标背景图像的某一区域(例如,中心区域),得到目标图像,从而实现更自然的融合效果,而且该目标图像即为输出的视频中的一帧。其中的融合利用了泊松融合技术,该技术是根据第三图像的梯度信息以及目标背景图像的边界信息,将第三图像中一个物体或者一个区域嵌入到目标背景图像生成一个新的图像,即目标图像。
数据处理装置还可以利用泊松融合将第一图像以及目标背景图像融合生成第一目标图像,还可以利用泊松融合将第二图像以及目标背景图像融合生成第二目标图像,并压缩第一目标图像、目标图像以及第二目标图像生成目标视频。
示例性的,若第三图像如图13所示,第三图像为一张,则生成的目标视频可以如图19所示。其中,目标视频的第一帧为第一目标图像,目标视频的第二帧为目标图像,目标视频的第三帧为第二目标图像。
示例性的,若第三图像如图14所示,第三图像为两张,则生成的目标视频可以如图20所示。其中,目标视频的第一帧为第一目标图像,目标视频的第二帧以及第三帧为目标图像,目标视频的第四帧为第二目标图像。
数据处理装置得到目标视频后,可以向第一拍摄设备和/或第二拍摄设备发送目标视频,使得使用第一拍摄设备和/或第二拍摄设备的用户可以观看目标视频。本申请实施例中,一方面:通过利用相对位姿获取到的参考图像I4和I7以及通过光流信息获取到的参考图像I5和I6,可以使得参考图像之间有互补的地方,便于图像修复网络处理掉更多瑕疵,使得生成的第三图像更加平缓。另一方面:可以根据第一图像、第二图像、第三图像以及目标背景图像生成目标视频(也可以是全景精彩视频)。另外,由于第一拍摄设备、第二拍摄设备以及数据处理装置可以是手机,可以使用移动的手机来生成全景精彩瞬间(即目标视频),相较于固定机位的全景精彩瞬间,本方法更具有灵活性。
相应于上述方法实施例给出的方法,本申请实施例还提供了相应的装置,包括用于执行 上述实施例相应的模块。所述模块可以是软件,也可以是硬件,或者是软件和硬件结合。
请参阅图21,本申请实施例中数据处理装置的一个实施例,该数据处理装置可以是本地设备(例如,手机、摄像机等)或云端设备。该数据处理装置包括:
获取单元2101,用于获取第一图像以及第二图像,第一图像为第一视角下采集的图像,第二图像为第二视角下采集的图像,第一图像的采集时刻与第二图像的采集时刻相同。
可选地,第一图像与第二图像为第一采集设备与第二采集设备针对同一拍摄对象在同一时刻不同视角下采集的图像,且第一图像与第二图像存在第一交叠内容,该第一交叠内容包括该拍摄对象;
获取单元2101,还用于获取第一图像与第二图像之间的相对位姿。
生成单元2102,用于基于第一图像、第二图像以及相对位姿,生成第三图像,第三图像的视角在第一视角与第二视角之间。
可选地,上述的相对位姿可以理解为是第一位姿与第二位姿之间的相对位姿,其中,第一位姿为第一采集设备采集第一图像时的位姿;第二位姿为第二采集设备采集第二图像时的位姿。
可选地,第三图像包括第一交叠内容的部分或全部,且第三图像包括该拍摄对象。可选地,第一图像中的拍摄对象与第二图像中的拍摄对象的朝向有重叠。
本实施例中,数据处理装置中各单元所执行的操作与前述图6至图20所示实施例中描述的类似,此处不再赘述。
本实施例中,生成单元2102基于第一图像、第二图像以及第一图像与第二图像之间的相对位姿,生成第三图像,该第三图像的视角在第一视角与第二视角之间。通过已有的视角图像以及相对位姿合成其他视角图像,提升输出效果的精细程度。
请参阅图22,本申请实施例中数据处理装置的另一实施例,该数据处理装置可以是本地设备(例如,手机、摄像机等)或云端设备。该数据处理装置包括:
获取单元2201,用于获取第一图像以及第二图像,第一图像为第一视角下采集的图像,第二图像为第二视角下采集的图像,第一图像的采集时刻与第二图像的采集时刻相同。
获取单元2201,还用于获取第一图像与第二图像之间的相对位姿。
生成单元2202,用于基于第一图像、第二图像以及相对位姿,生成第三图像,第三图像的视角在第一视角与第二视角之间。
上述生成单元2202还包括:
光流计算子单元22021,用于将第一图像以及第二图像输入训练好的光流计算网络进行光流计算,得到初始光流图。
第一扭曲子单元22022,用于通过前置图像扭曲forwad warping方法处理第一图像以及初始光流图得到第一目标光流图像。
第一扭曲子单元22022,还用于通过forwad warping方法处理第二图像以及初始光流图得到第二目标光流图像。
第二扭曲子单元22023,用于通过图片图像扭曲image warping方法处理第一图像以及第一相对位姿得到第一扭曲图像。
第二扭曲子单元22023,用于通过image warping方法处理第二图像以及第二相对位姿 得到第二扭曲图像。
修复子单元22024,用于将第一目标光流图像、第一扭曲图像、第二目标光流图像、第二扭曲图像输入训练好的图像修复网络进行图像修复,得到第三图像。
本实施例中的数据处理装置还包括:
拼接单元2203,用于拼接第一背景图像以及第二背景图像,得到目标背景图像。
融合单元2204,用于融合第三图像以及目标背景图像,得到目标图像。
融合单元2204,还用于融合第一图像以及目标背景图像,得到第一目标图像。
融合单元2204,还用于融合第二图像以及目标背景图像,得到第二目标图像。
压缩单元2205,用于压缩第一目标图像、目标图像以及第二目标图像得到目标视频。
发送单元2206,用于向第一拍摄设备发送目标视频,第一拍摄设备为采集第一图像的设备。
可选地,第一背景图像是第一图像中除了拍摄对象以外的背景,第二背景图像是第二图像中除了拍摄对象以外的背景。
可选地,压缩单元2205,用于基于第一目标图像、目标图像以及第二目标图像生成目标视频。
本实施例中,数据处理装置中各单元所执行的操作与前述图6至图20所示实施例中描述的类似,此处不再赘述。
本实施例中,一方面:生成单元2102通过利用相对位姿获取到的参考图像I4和I7以及通过光流信息获取到的参考图像I5和I6,可以使得参考图像之间有互补的地方,便于图像修复网络处理掉更多瑕疵,使得生成的第三图像更加平缓。另一方面:压缩单元2106可以根据第一图像、第二图像、第三图像以及目标背景图像生成目标视频(也可以是全景精彩视频)。另外,由于第一拍摄设备、第二拍摄设备以及数据处理装置可以是手机,可以使用移动的手机来生成全景精彩瞬间(即目标视频),相较于固定机位的全景精彩瞬间,本方法更具有灵活性。
请参阅图23,本申请实施例提供了另一种数据处理装置,为了便于说明,仅示出了与本申请实施例相关的部分,具体技术细节未揭示的,请参照本申请实施例方法部分。该数据处理装置可以为包括手机、平板电脑、个人数字助理(personal digital assistant,PDA)、销售终端设备(point of sales,POS)、车载电脑等任意终端设备,以数据处理装置为手机为例:
图23示出的是与本申请实施例提供的手机的部分结构的框图。参考图23,手机包括:射频(radio frequency,RF)电路2210、存储器2220、输入单元2230、显示单元2240、传感器2250、音频电路2260、无线保真(wireless fidelity,WiFi)模块2270、处理器2280、以及摄像头2290等部件。本领域技术人员可以理解,图23中示出的手机结构并不构成对手机的限定,可以包括比图示更多或更少的部件,或者组合某些部件,或者不同的部件布置。
下面结合图23对手机的各个构成部件进行具体的介绍:
RF电路2310可用于收发信息或通话过程中,信号的接收和发送,特别地,将基站的下行信息接收后,给处理器2380处理;另外,将设计上行的数据发送给基站。通常,RF电路2310包括但不限于天线、至少一个放大器、收发信机、耦合器、低噪声放大器(low noise  amplifier,LNA)、双工器等。此外,RF电路2310还可以通过无线通信与网络和其他设备通信。上述无线通信可以使用任一通信标准或协议,包括但不限于全球移动通讯系统(global system of mobile communication,GSM)、通用分组无线服务(general packet radio service,GPRS)、码分多址(code division multiple access,CDMA)、宽带码分多址(wideband code division multiple access,WCDMA)、长期演进(long term evolution,LTE)、电子邮件、短消息服务(short messaging service,SMS)等。
存储器2320可用于存储软件程序以及模块,处理器2380通过运行存储在存储器2320的软件程序以及模块,从而执行手机的各种功能应用以及数据处理。存储器2320可主要包括存储程序区和存储数据区,其中,存储程序区可存储操作系统、至少一个功能所需的应用程序(比如声音播放功能、图像播放功能等)等;存储数据区可存储根据手机的使用所创建的数据(比如音频数据、电话本等)等。此外,存储器2320可以包括高速随机存取存储器,还可以包括非易失性存储器,例如至少一个磁盘存储器件、闪存器件、或其他易失性固态存储器件。
输入单元2330可用于接收输入的数字或字符信息,以及产生与手机的用户设置以及功能控制有关的键信号输入。具体地,输入单元2330可包括触控面板2331以及其他输入设备2332。触控面板2331,也称为触摸屏,可收集用户在其上或附近的触摸操作(比如用户使用手指、触笔等任何适合的物体或附件在触控面板2331上或在触控面板2331附近的操作),并根据预先设定的程式驱动相应的连接装置。可选的,触控面板2331可包括触摸检测装置和触摸控制器两个部分。其中,触摸检测装置检测用户的触摸方位,并检测触摸操作带来的信号,将信号传送给触摸控制器;触摸控制器从触摸检测装置上接收触摸信息,并将它转换成触点坐标,再送给处理器2380,并能接收处理器2380发来的命令并加以执行。此外,可以采用电阻式、电容式、红外线以及表面声波等多种类型实现触控面板2331。除了触控面板2331,输入单元2330还可以包括其他输入设备2332。具体地,其他输入设备2332可以包括但不限于物理键盘、功能键(比如音量控制按键、开关按键等)、轨迹球、鼠标、操作杆等中的一种或多种。
显示单元2340可用于显示由用户输入的信息或提供给用户的信息以及手机的各种菜单。显示单元2340可包括显示面板2341,可选的,可以采用液晶显示器(liquid crystal display,LCD)、有机发光二极管(organic light-emitting diode,OLED)等形式来配置显示面板2341。进一步的,触控面板2331可覆盖显示面板2341,当触控面板2331检测到在其上或附近的触摸操作后,传送给处理器2380以确定触摸事件的类型,随后处理器2380根据触摸事件的类型在显示面板2341上提供相应的视觉输出。虽然在图23中,触控面板2331与显示面板2341是作为两个独立的部件来实现手机的输入和输入功能,但是在某些实施例中,可以将触控面板2331与显示面板2341集成而实现手机的输入和输出功能。
手机还可包括至少一种传感器2350,比如光传感器、运动传感器以及其他传感器。具体地,光传感器可包括环境光传感器及接近传感器,其中,环境光传感器可根据环境光线的明暗来调节显示面板2341的亮度,接近传感器可在手机移动到耳边时,关闭显示面板2341和/或背光。作为运动传感器的一种,加速计传感器可检测各个方向上(一般为三轴)加速度的大小,静止时可检测出重力的大小及方向,可用于识别手机姿态的应用(比如横竖屏切换、相关游戏、磁力计姿态校准)、振动识别相关功能(比如计步器、敲击)等;至于手机还可配 置的陀螺仪、气压计、湿度计、温度计、红外线传感器等其他传感器,在此不再赘述。
音频电路2360、扬声器2361,传声器2362可提供用户与手机之间的音频接口。音频电路2360可将接收到的音频数据转换后的电信号,传输到扬声器2361,由扬声器2361转换为声音信号输出;另一方面,传声器2362将收集的声音信号转换为电信号,由音频电路2360接收后转换为音频数据,再将音频数据输出处理器2380处理后,经RF电路2310以发送给比如另一手机,或者将音频数据输出至存储器2320以便进一步处理。
WiFi属于短距离无线传输技术,手机通过WiFi模块2370可以帮助用户收发电子邮件、浏览网页和访问流式媒体等,它为用户提供了无线的宽带互联网访问。虽然图23示出了WiFi模块2370,但是可以理解的是,其并不属于手机的必须构成。
处理器2380是手机的控制中心,利用各种接口和线路连接整个手机的各个部分,通过运行或执行存储在存储器2320内的软件程序和/或模块,以及调用存储在存储器2320内的数据,执行手机的各种功能和处理数据,从而对手机进行整体监控。可选的,处理器2380可包括一个或多个处理单元;优选的,处理器2380可集成应用处理器和调制解调处理器,其中,应用处理器主要处理操作系统、用户界面和应用程序等,调制解调处理器主要处理无线通信。可以理解的是,上述调制解调处理器也可以不集成到处理器2380中。
手机还包括给各个部件供电的摄像头2390,优选的,摄像头2390可以采集第一图像和/或第二图像,并与处理器2380逻辑相连,从而通过处理器2380实现对第一图像以及第三图像的处理,具体处理流程可参照前述图6至图20所示实施例中的步骤。
尽管未示出,手机还可以包括电源(比如电池)、蓝牙模块等,在此不再赘述。优选的,电源可以通过电源管理系统与处理器2380逻辑相连,从而通过电源管理系统实现管理充电、放电、以及功耗管理等功能。
在本申请实施例中,该数据处理装置所包括的处理器2380可以执行前述图6至图20所示实施例中的功能,此处不再赘述。
下面介绍本申请实施例提供的一种芯片硬件结构。
图24为本发明实施例提供的一种芯片硬件结构,该芯片包括神经网络处理器240。该芯片可以被设置在如图3所示的执行设备110中,用以完成计算模块111的计算工作。该芯片也可以被设置在如图3所示的训练设备120中,用以完成训练设备120的训练工作并输出目标模型/规则101。如图4或图5所示的卷积神经网络中各层的算法均可在如图24所示的芯片中得以实现。
神经网络处理器NPU 50NPU作为协处理器挂载到主CPU(Host CPU)上,由Host CPU分配任务。NPU的核心部分为运算电路2403,控制器2404控制运算电路2403提取存储器(权重存储器或输入存储器)中的数据并进行运算。
在一些实现中,运算电路2403内部包括多个处理单元(process engine,PE)。在一些实现中,运算电路2403是二维脉动阵列。运算电路2403还可以是一维脉动阵列或者能够执行例如乘法和加法这样的数学运算的其它电子线路。在一些实现中,运算电路2403是通用的矩阵处理器。
举例来说,假设有输入矩阵A,权重矩阵B,输出矩阵C。运算电路从权重存储器2402中取矩阵B相应的数据,并缓存在运算电路中每一个PE上。运算电路从输入存储器2401中 取矩阵A数据与矩阵B进行矩阵运算,得到的矩阵的部分结果或最终结果,保存在累加器2408accumulator中。
向量计算单元2407可以对运算电路的输出做进一步处理,如向量乘,向量加,指数运算,对数运算,大小比较等等。例如,向量计算单元2407可以用于神经网络中非卷积/非FC层的网络计算,如池化(Pooling),批归一化(batch normalization),局部响应归一化(local response normalization)等。
在一些实现种,向量计算单元2407将经处理的输出的向量存储到统一缓存器506。例如,向量计算单元2407可以将非线性函数应用到运算电路2403的输出,例如累加值的向量,用以生成激活值。在一些实现中,向量计算单元2407生成归一化的值、合并值,或二者均有。在一些实现中,处理过的输出的向量能够用作到运算电路2403的激活输入,例如用于在神经网络中的后续层中的使用。
统一存储器2406用于存放输入数据以及输出数据。
权重数据直接通过存储单元访问控制器2405(direct memory access controller,DMAC)将外部存储器中的输入数据搬运到输入存储器2401和/或统一存储器2406、将外部存储器中的权重数据存入权重存储器2402,以及将统一存储器2406中的数据存入外部存储器。
总线接口单元(bus interface unit,BIU)2410,用于通过总线实现主CPU、DMAC和取指存储器2409之间进行交互。
与控制器2404连接的取指存储器(instruction fetch buffer)2409,用于存储控制器2404使用的指令。
控制器2404,用于调用指存储器2409中缓存的指令,实现控制该运算加速器的工作过程。
一般地,统一存储器2406,输入存储器2401,权重存储器2402以及取指存储器2409均为片上(On-Chip)存储器,外部存储器为该NPU外部的存储器,该外部存储器可以为双倍数据率同步动态随机存储器(double data rate synchronous dynamic random access memory,简称DDR SDRAM)、高带宽存储器(high bandwidth memory,HBM)或其他可读可写的存储器。
其中,图4或图5所示的卷积神经网络中各层的运算可以由运算电路2403或向量计算单元2407执行。
在本申请所提供的几个实施例中,应该理解到,所揭露的系统,装置和方法,可以通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如,所述单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,装置或单元的间接耦合或通信连接,可以是电性,机械或其它的形式。
所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。
另外,在本申请各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个 单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用软件功能单元的形式实现。
所述集成的单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的全部或部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本申请各个实施例所述方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(read-only memory,ROM)、随机存取存储器(random access memory,RAM)、磁碟或者光盘等各种可以存储程序代码的介质。

Claims (18)

  1. 一种数据处理方法,其特征在于,包括:
    获取第一图像以及第二图像,所述第一图像与所述第二图像为第一采集设备与第二采集设备针对同一拍摄对象在同一时刻不同视角下采集的图像,且所述第一图像与所述第二图像存在第一交叠内容,所述第一交叠内容包括所述拍摄对象;
    获取第一位姿与第二位姿之间的相对位姿;所述第一位姿为所述第一采集设备采集所述第一图像时的位姿;所述第二位姿为所述第二采集设备采集所述第二图像时的位姿;
    基于所述第一图像、所述第二图像以及所述相对位姿,生成第三图像,所述第三图像包括所述第一交叠内容的部分或全部,且所述第三图像包括所述拍摄对象。
  2. 根据权利要求1所述的方法,其特征在于,所述方法还包括:
    基于所述第一图像的背景、所述第二图像的背景以及所述第三图像得到目标图像,所述目标图像包括所述第三图像中的所述拍摄对象,且所述目标图像还包括所述第一图像的背景中的部分或全部与所述第二图像的背景中的部分或全部,所述第一图像的背景为所述第一图像中除了所述拍摄对象以外的背景,所述第二图像的背景为所述第二图像中除了所述拍摄对象以外的背景。
  3. 根据权利要求2所述的方法,其特征在于,所述基于所述第一图像的背景、所述第二图像的背景以及所述第三图像得到目标图像,包括:
    分离所述第一图像中的所述拍摄对象得到第一空洞图像;
    基于所述第一图像填充所述第一空洞图像得到所述第一图像的背景;
    分离所述第二图像中的所述拍摄对象得到第二空洞图像;
    基于所述第二图像填充所述第二空洞图像得到所述第二图像的背景;
    拼接所述第一图像的背景、所述第二图像的背景以及所述第三图像得到所述目标图像。
  4. 根据权利要求3所述的方法,其特征在于,所述方法还包括:
    融合所述第一图像、所述第一图像的背景以及所述第二图像的背景,得到第一目标图像;
    融合所述第二图像、所述第一图像的背景以及所述第二图像的背景,得到第二目标图像;
    基于所述第一目标图像、所述目标图像以及所述第二目标图像生成目标视频。
  5. 根据权利要求4所述的方法,其特征在于,所述方法还包括:
    向所述第一采集设备和/或所述第二采集设备发送所述目标视频。
  6. 根据权利要求1至5中任一项所述的方法,其特征在于,所述相对位姿包括第一相对位姿与第二相对位姿,所述第一相对位姿为所述第一采集时刻的所述第一采集设备相对于所述第二采集时刻的所述第二采集设备的位姿,所述第二相对位姿为所述第二采集时刻的所述第二采集设备相对于所述第一采集时刻的所述第一采集设备的位姿;
    所述基于所述第一图像、所述第二图像以及所述相对位姿,生成第三图像,包括:
    将所述第一图像以及所述第二图像输入训练好的光流计算网络进行光流计算,得到初始光流图;
    通过前置图像扭曲forwad warping方法处理所述第一图像以及所述初始光流图得到第一目标光流图像;
    通过forwad warping方法处理所述第二图像以及所述初始光流图得到第二目标光流图像;
    通过图片图像扭曲image warping方法处理所述第一图像以及所述第一相对位姿得到第一扭曲图像;
    通过image warping方法处理所述第二图像以及所述第二相对位姿得到第二扭曲图像;
    将所述第一目标光流图像、所述第一扭曲图像、所述第二目标光流图像、所述第二扭曲图像输入训练好的图像修复网络进行图像修复,得到所述第三图像。
  7. 根据权利要求6所述的方法,其特征在于,所述训练好的光流计算网络以及所述训练好的图像修复网络是通过以第一训练图像以及第二训练图像作为所述光流计算网络的输入,以损失函数的值小于第二阈值为目标对光流计算网络以及图像修复网络进行联合训练得到;
    所述损失函数用于指示图像修复网络输出的图像与第三目标图像之间的差异,所述第三目标图像为在第一目标图像对应第一目标视角和第二目标图像对应第二目标视角之间的视角下采集的图像。
  8. 根据权利要求1至7中任一项所述的方法,所述相对位姿的表达形式为变换矩阵,所述变换矩阵用于描述所述第一图像与所述第二图像之间像素点的关联关系。
  9. 一种数据处理装置,其特征在于,包括:
    获取单元,用于获取第一图像以及第二图像,所述第一图像与所述第二图像为第一采集设备与第二采集设备针对同一拍摄对象在同一时刻不同视角下采集的图像,且所述第一图像与所述第二图像存在第一交叠内容,所述第一交叠内容包括所述拍摄对象;
    所述获取单元,还用于获取第一位姿与第二位姿之间的相对位姿;所述第一位姿为所述第一采集设备采集所述第一图像时的位姿;所述第二位姿为所述第二采集设备采集所述第二图像时的位姿;
    生成单元,用于基于所述第一图像、所述第二图像以及所述相对位姿,生成第三图像,所述第三图像包括所述第一交叠内容的部分或全部,且所述第三图像包括所述拍摄对象。
  10. 根据权利要求9所述的数据处理装置,其特征在于,所述数据处理装置还包括:
    拼接单元,用于基于所述第一图像的背景、所述第二图像的背景以及所述第三图像得到目标图像,所述目标图像包括所述第三图像中的所述拍摄对象,且所述目标图像还包括所述第一图像的背景中的部分或全部与所述第二图像的背景中的部分或全部,所述第一图像的背景为所述第一图像中除了所述拍摄对象以外的背景,所述第二图像的背景为所述第二图像中除了所述拍摄对象以外的背景。
  11. 根据权利要求10所述的数据处理装置,其特征在于,所述拼接单元,具体用于分离所述第一图像中的所述拍摄对象得到第一空洞图像;
    所述拼接单元,具体用于基于所述第一图像填充所述第一空洞图像得到所述第一图像的背景;
    所述拼接单元,具体用于分离所述第二图像中的所述拍摄对象得到第二空洞图像;
    所述拼接单元,具体用于基于所述第二图像填充所述第二空洞图像得到所述第二图像的背景;
    所述拼接单元,具体用于拼接所述第一图像的背景、所述第二图像的背景以及所述第三图像得到所述目标图像。
  12. 根据权利要求11所述的数据处理装置,其特征在于,所述数据处理装置还包括:
    融合单元,用于融合所述第一图像、所述第一图像的背景以及所述第二图像的背景,得到第一目标图像;
    所述融合单元,还用于融合所述第二图像、所述第一图像的背景以及所述第二图像的背景,得到第二目标图像;
    压缩单元,用于基于所述第一目标图像、所述目标图像以及所述第二目标图像生成目标视频。
  13. 根据权利要求12所述的数据处理装置,其特征在于,所述数据处理装置还包括:
    发送单元,用于向所述第一采集设备和/或所述第二采集设备发送所述目标视频。
  14. 根据权利要求9至13中任一项所述的数据处理装置,其特征在于,所述相对位姿包括第一相对位姿与第二相对位姿,所述第一相对位姿为所述第一采集时刻的所述第一采集设备相对于所述第二采集时刻的所述第二采集设备的位姿,所述第二相对位姿为所述第二采集时刻的所述第二采集设备相对于所述第一采集时刻的所述第一采集设备的位姿;
    所述生成单元包括:
    光流子单元,用于将所述第一图像以及所述第二图像输入训练好的光流计算网络进行光流计算,得到初始光流图;
    第一扭曲单元,用于通过前置图像扭曲forwad warping方法处理所述第一图像以及所述初始光流图得到第一目标光流图像;
    所述第一扭曲单元,还用于通过forwad warping方法处理所述第二图像以及所述初始光流图得到第二目标光流图像;
    第二扭曲单元,用于通过图片图像扭曲image warping方法处理所述第一图像以及所述第一相对位姿得到第一扭曲图像;
    所述第二扭曲单元,用于通过image warping方法处理所述第二图像以及所述第二相对位姿得到第二扭曲图像;
    修复子单元,用于将所述第一目标光流图像、所述第一扭曲图像、所述第二目标光流图像、所述第二扭曲图像输入训练好的图像修复网络进行图像修复,得到所述第三图像。
  15. 根据权利要求14所述的数据处理装置,其特征在于,所述训练好的光流计算网络以及所述训练好的图像修复网络是通过以第一训练图像以及第二训练图像作为所述光流计算网络的输入,以损失函数的值小于第二阈值为目标对光流计算网络以及图像修复网络进行联合训练得到;
    所述损失函数用于指示图像修复网络输出的图像与第三目标图像之间的差异,所述第三目标图像为在第一目标图像对应第一目标视角和第二目标图像对应第二目标视角之间的视角下采集的图像。
  16. 根据权利要求9至15中任一项所述的数据处理装置,其特征在于,所述相对位姿的表达形式为变换矩阵,所述变换矩阵用于描述所述第一图像与所述第二图像之间像素点的关联关系。
  17. 一种数据处理装置,其特征在于,包括:处理器,所述处理器与存储器耦合,所述存储器用于存储程序或指令,当所述程序或指令被所述处理器执行时,使得所述数据处理装置执行如权利要求1至8中任一项所述的方法。
  18. 一种计算机可读存储介质,其特征在于,所述计算机可读存储介质中存储有指令,所述指令在计算机上执行时,使得所述计算机执行如权利要求1至8中任一项所述的方法。
PCT/CN2021/095141 2020-10-23 2021-05-21 一种数据处理方法及相关设备 WO2022083118A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202011148726.1 2020-10-23
CN202011148726.1A CN114511596A (zh) 2020-10-23 2020-10-23 一种数据处理方法及相关设备

Publications (1)

Publication Number Publication Date
WO2022083118A1 true WO2022083118A1 (zh) 2022-04-28

Family

ID=81291489

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/095141 WO2022083118A1 (zh) 2020-10-23 2021-05-21 一种数据处理方法及相关设备

Country Status (2)

Country Link
CN (1) CN114511596A (zh)
WO (1) WO2022083118A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115293971A (zh) * 2022-09-16 2022-11-04 荣耀终端有限公司 图像拼接方法及装置

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115359097A (zh) * 2022-10-20 2022-11-18 湖北芯擎科技有限公司 稠密光流生成方法、装置、电子设备及可读存储介质

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107317998A (zh) * 2016-04-27 2017-11-03 成都理想境界科技有限公司 全景视频图像融合方法及装置
US10013763B1 (en) * 2015-09-28 2018-07-03 Amazon Technologies, Inc. Increasing field of view using multiple devices
CN109584340A (zh) * 2018-12-11 2019-04-05 苏州中科广视文化科技有限公司 基于深度卷积神经网络的新视角合成方法
CN110798673A (zh) * 2019-11-13 2020-02-14 南京大学 基于深度卷积神经网络的自由视点视频生成及交互方法
CN111275750A (zh) * 2020-01-19 2020-06-12 武汉大学 基于多传感器融合的室内空间全景图像生成方法
US20200279398A1 (en) * 2019-02-28 2020-09-03 Stats Llc System and method for calibrating moving camera capturing broadcast video

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10013763B1 (en) * 2015-09-28 2018-07-03 Amazon Technologies, Inc. Increasing field of view using multiple devices
CN107317998A (zh) * 2016-04-27 2017-11-03 成都理想境界科技有限公司 全景视频图像融合方法及装置
CN109584340A (zh) * 2018-12-11 2019-04-05 苏州中科广视文化科技有限公司 基于深度卷积神经网络的新视角合成方法
US20200279398A1 (en) * 2019-02-28 2020-09-03 Stats Llc System and method for calibrating moving camera capturing broadcast video
CN110798673A (zh) * 2019-11-13 2020-02-14 南京大学 基于深度卷积神经网络的自由视点视频生成及交互方法
CN111275750A (zh) * 2020-01-19 2020-06-12 武汉大学 基于多传感器融合的室内空间全景图像生成方法

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115293971A (zh) * 2022-09-16 2022-11-04 荣耀终端有限公司 图像拼接方法及装置
CN115293971B (zh) * 2022-09-16 2023-02-28 荣耀终端有限公司 图像拼接方法及装置

Also Published As

Publication number Publication date
CN114511596A (zh) 2022-05-17

Similar Documents

Publication Publication Date Title
US20220076000A1 (en) Image Processing Method And Apparatus
WO2020192483A1 (zh) 图像显示方法和设备
WO2019128508A1 (zh) 图像处理方法、装置、存储介质及电子设备
WO2019007258A1 (zh) 相机姿态信息的确定方法、装置、设备及存储介质
US20200302154A1 (en) Image processing method, apparatus, storage medium, and electronic device
WO2019134516A1 (zh) 全景图像生成方法、装置、存储介质及电子设备
WO2019238114A1 (zh) 动态模型三维重建方法、装置、设备和存储介质
KR20210111833A (ko) 타겟의 위치들을 취득하기 위한 방법 및 장치와, 컴퓨터 디바이스 및 저장 매체
CN108712603B (zh) 一种图像处理方法及移动终端
WO2022083118A1 (zh) 一种数据处理方法及相关设备
CN107948505B (zh) 一种全景拍摄方法及移动终端
CN113205560B (zh) 多深度相机的标定方法、装置、设备及存储介质
CN114339054B (zh) 拍照模式的生成方法、装置和计算机可读存储介质
CN108776822B (zh) 目标区域检测方法、装置、终端及存储介质
CN112927362A (zh) 地图重建方法及装置、计算机可读介质和电子设备
WO2021147921A1 (zh) 图像处理方法、电子设备及计算机可读存储介质
WO2022052782A1 (zh) 图像的处理方法及相关设备
WO2022100419A1 (zh) 一种图像处理方法及相关设备
CN112308977B (zh) 视频处理方法、视频处理装置和存储介质
WO2022165722A1 (zh) 单目深度估计方法、装置及设备
WO2023151511A1 (zh) 模型训练方法、图像去摩尔纹方法、装置及电子设备
CN110807769B (zh) 图像显示控制方法及装置
CN113284055A (zh) 一种图像处理的方法以及装置
CN110086998B (zh) 一种拍摄方法及终端
CN110135329B (zh) 从视频中提取姿势的方法、装置、设备及存储介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21881542

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21881542

Country of ref document: EP

Kind code of ref document: A1