CN116823691A - Light field image processing method and device - Google Patents

Light field image processing method and device Download PDF

Info

Publication number
CN116823691A
CN116823691A CN202310796486.3A CN202310796486A CN116823691A CN 116823691 A CN116823691 A CN 116823691A CN 202310796486 A CN202310796486 A CN 202310796486A CN 116823691 A CN116823691 A CN 116823691A
Authority
CN
China
Prior art keywords
image
target
processed
pixels
boundary
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310796486.3A
Other languages
Chinese (zh)
Inventor
李治富
李文宇
苗京花
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
BOE Technology Group Co Ltd
Beijing BOE Display Technology Co Ltd
Beijing BOE Technology Development Co Ltd
Original Assignee
BOE Technology Group Co Ltd
Beijing BOE Display Technology Co Ltd
Beijing BOE Technology Development Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by BOE Technology Group Co Ltd, Beijing BOE Display Technology Co Ltd, Beijing BOE Technology Development Co Ltd filed Critical BOE Technology Group Co Ltd
Priority to CN202310796486.3A priority Critical patent/CN116823691A/en
Publication of CN116823691A publication Critical patent/CN116823691A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T5/00Image enhancement or restoration
    • G06T5/50Image enhancement or restoration using two or more images, e.g. averaging or subtraction
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T3/00Geometric image transformations in the plane of the image
    • G06T3/40Scaling of whole images or parts thereof, e.g. expanding or contracting
    • G06T3/403Edge-driven scaling; Edge-based scaling
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/10Image acquisition modality
    • G06T2207/10052Images from lightfield camera
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T2207/00Indexing scheme for image analysis or image enhancement
    • G06T2207/20Special algorithmic details
    • G06T2207/20212Image combination
    • G06T2207/20221Image fusion; Image merging

Landscapes

  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Image Processing (AREA)

Abstract

The disclosure relates to the technical field of image processing, and particularly provides a light field image processing method and device. A light field image processing method comprises the steps of obtaining a plurality of to-be-processed images respectively collected through a plurality of cameras arranged on a collecting device, determining a boundary range of a target object on the to-be-processed image for each to-be-processed image, cutting the to-be-processed image according to the boundary range and a preset image scale to obtain a target image corresponding to each to-be-processed image, and carrying out viewpoint fusion processing on each target image according to target viewpoint information to obtain a target light field image. In the embodiment of the disclosure, the image to be processed is cut based on the boundary range of the target object, so that the risk of placing the target object at the edge of the image and even cutting off the target object can be reduced, the quality of the cut image is improved, a better data base is provided for the subsequent viewpoint fusion processing, and the quality and efficiency of light field video communication are improved.

Description

Light field image processing method and device
Technical Field
The disclosure relates to the technical field of image processing, in particular to a light field image processing method and device.
Background
The Light Field (Light Field) can record Light data with higher dimensionality, so that three-dimensional information with higher precision than traditional two-dimensional imaging and traditional three-dimensional imaging represented by binocular stereoscopic vision is obtained, and the Light Field video can accurately sense a dynamic environment, so that a user can feel an immersive viewing experience.
With the development of Machine Learning (Machine Learning) technology, three-dimensional reconstruction is performed on an input Multi-view image by using a Multi-view Stereo matching model (MVS Net) based on a deep neural network, so that the method has better precision and efficiency.
However, for a real-time light field video scene, the character position is not fixed, so that the character effect is difficult to be ensured for cutting the multi-view input image of the MVS Net in the traditional scheme, and the network output effect is poor.
Disclosure of Invention
In order to improve the cutting effect on multi-viewpoint images in a light field video scene and further improve the processing effect and efficiency of the light field video, the embodiment of the disclosure provides a light field image processing method, a device, electronic equipment, a video communication system and a storage medium.
In a first aspect, embodiments of the present disclosure provide a light field image processing method, including:
Acquiring a plurality of images to be processed respectively acquired by a plurality of cameras arranged on acquisition equipment; the plurality of images to be processed are acquired images with different visual angles of the target object;
for each image to be processed, determining the boundary range of the target object on the image to be processed;
cutting the images to be processed according to the boundary range and the preset image scale to obtain target images corresponding to each image to be processed;
performing viewpoint fusion processing on each target image according to target viewpoint information to obtain a target light field image corresponding to the target viewpoint information; the target viewpoint information represents position information of eyes of an observer at the display device side.
In some embodiments, the determining, for each image to be processed, a boundary range of the target object on the image to be processed includes:
for each image to be processed, carrying out binarization processing on the image to be processed to obtain a binary image of the target object;
based on the pixel values on the binary image, sequentially carrying out boundary search on the target object row by row and column by column to obtain a horizontal boundary and a vertical boundary of the target object on the binary image;
The boundary range is determined based on the horizontal boundary and the vertical boundary.
In some embodiments, the performing, based on the pixel values on the binary image, the boundary search on the target object sequentially row by row and column by column, to obtain a horizontal boundary and a vertical boundary of the target object on the binary image, includes at least one of the following:
detecting the number of black pixels of the column of pixels from left to right in sequence on the basis of the pixel values on the binary image, and determining coordinate information corresponding to the first column of pixels as a left boundary of the horizontal boundary in response to the number of black pixels of the first column of pixels and the following continuous preset number of columns of pixels being larger than a first preset threshold;
detecting the number of black pixels of the column of pixels from right to left sequentially column by column, and determining coordinate information corresponding to the second column of pixels as a right boundary of the horizontal boundary in response to the number of black pixels of the second column of pixels and the following continuous preset number of columns of pixels being larger than a second preset threshold;
detecting the number of black pixels of the row of pixels line by line in sequence from top to bottom, and determining coordinate information corresponding to the first row of pixels as an upper boundary of the vertical boundary in response to the number of black pixels of the first row of pixels and the continuous preset number of rows of pixels behind the first row of pixels being larger than a third preset threshold;
Detecting the number of black pixels of the row line by line in sequence from bottom to top, and determining coordinate information corresponding to the pixels of the second row as the lower boundary of the vertical boundary in response to the number of the black pixels of the second row and the continuous preset number of the pixels of the row after the second row being larger than a fourth preset threshold.
In some embodiments, for each image to be processed, performing binarization processing on the image to be processed to obtain a binary image of the target object, where the binarization processing includes:
carrying out matting processing on each image to be processed to obtain a foreground image which corresponds to each image to be processed and comprises the target object;
and carrying out binarization processing on each foreground image to obtain a binary image of the target object.
In some embodiments, before the boundary searching is sequentially performed on the target object row by row and column by column based on the pixel values on the binary image, to obtain a horizontal boundary and a vertical boundary of the target object on the binary image, the method further includes:
searching on the binary image according to a preset step length by using a sliding window with a preset scale based on the pixel value on the binary image;
in each sliding window, denoising the pixels included in the sliding window based on the sum of pixel values of the pixels included in the sliding window.
In some embodiments, the cutting the to-be-processed image according to the boundary range and the preset image scale to obtain a target image corresponding to each to-be-processed image includes:
determining the center point coordinates of the target object according to the boundary range;
and determining the center point coordinate of the target object as the center point coordinate of the target image, and cutting the image to be processed according to the preset image scale to obtain the target image.
In some embodiments, performing viewpoint fusion processing on each target image according to target viewpoint information to obtain a target light field image corresponding to the target viewpoint information, including:
inputting at least two images to be processed in the plurality of images to be processed into a pre-trained depth network model to obtain a depth map of the target object output by the depth network model;
performing viewpoint fusion processing on the depth map based on the target viewpoint information to obtain a target viewpoint depth map under a viewpoint corresponding to the target viewpoint information;
and inputting the target image, the target viewpoint depth map and the target viewpoint information into a pre-trained viewpoint fusion model to obtain the target light field image output by the viewpoint fusion model.
In some embodiments, applied to the acquisition device; after performing viewpoint fusion processing on each target image according to the target viewpoint information to obtain a target light field image corresponding to the target viewpoint information, the method further comprises:
and sending the target light field image to the display device so that the display device renders and displays the target light field image.
In a second aspect, the present disclosure provides a light field image processing apparatus comprising:
an image acquisition module configured to acquire a plurality of images to be processed respectively acquired by a plurality of cameras provided on an acquisition device; the plurality of images to be processed are acquired images with different visual angles of the target object;
the boundary searching module is configured to determine the boundary range of the target object on each image to be processed;
the cutting processing module is configured to cut the image to be processed according to the boundary range and a preset image scale to obtain a target image corresponding to each image to be processed;
the viewpoint fusion module is configured to perform viewpoint fusion processing on each target image according to target viewpoint information to obtain a target light field image corresponding to the target viewpoint information; the target viewpoint information represents position information of eyes of an observer at the display device side.
In some embodiments, the boundary search module is configured to:
for each image to be processed, carrying out binarization processing on the image to be processed to obtain a binary image of the target object;
based on the pixel values on the binary image, sequentially carrying out boundary search on the target object row by row and column by column to obtain a horizontal boundary and a vertical boundary of the target object on the binary image;
the boundary range is determined based on the horizontal boundary and the vertical boundary.
In some embodiments, the boundary search module is configured to:
detecting the number of black pixels of the column of pixels from left to right in sequence on the basis of the pixel values on the binary image, and determining coordinate information corresponding to the first column of pixels as a left boundary of the horizontal boundary in response to the number of black pixels of the first column of pixels and the following continuous preset number of columns of pixels being larger than a first preset threshold;
detecting the number of black pixels of the column of pixels from right to left sequentially column by column, and determining coordinate information corresponding to the second column of pixels as a right boundary of the horizontal boundary in response to the number of black pixels of the second column of pixels and the following continuous preset number of columns of pixels being larger than a second preset threshold;
Detecting the number of black pixels of the row of pixels line by line in sequence from top to bottom, and determining coordinate information corresponding to the first row of pixels as an upper boundary of the vertical boundary in response to the number of black pixels of the first row of pixels and the continuous preset number of rows of pixels behind the first row of pixels being larger than a third preset threshold;
detecting the number of black pixels of the row line by line in sequence from bottom to top, and determining coordinate information corresponding to the pixels of the second row as the lower boundary of the vertical boundary in response to the number of the black pixels of the second row and the continuous preset number of the pixels of the row after the second row being larger than a fourth preset threshold.
In some embodiments, the boundary search module is configured to:
carrying out matting processing on each image to be processed to obtain a foreground image which corresponds to each image to be processed and comprises the target object;
and carrying out binarization processing on each foreground image to obtain a binary image of the target object.
In some embodiments, the boundary search module is configured to:
searching on the binary image according to a preset step length by using a sliding window with a preset scale based on the pixel value on the binary image;
in each sliding window, denoising the pixels included in the sliding window based on the sum of pixel values of the pixels included in the sliding window.
In some embodiments, the trimming processing module is configured to:
determining the center point coordinates of the target object according to the boundary range;
and determining the center point coordinate of the target object as the center point coordinate of the target image, and cutting the image to be processed according to the preset image scale to obtain the target image.
In some embodiments, the view fusion module is configured to:
inputting at least two images to be processed in the plurality of images to be processed into a pre-trained depth network model to obtain a depth map of the target object output by the depth network model;
performing viewpoint fusion processing on the depth map based on the target viewpoint information to obtain a target viewpoint depth map under a viewpoint corresponding to the target viewpoint information;
and inputting the target image, the target viewpoint depth map and the target viewpoint information into a pre-trained viewpoint fusion model to obtain the target light field image output by the viewpoint fusion model.
In some embodiments, the apparatus of the present disclosure is applied to the acquisition device, further comprising a transmission module configured to:
And sending the target light field image to the display device so that the display device renders and displays the target light field image.
In a third aspect, embodiments of the present disclosure provide an electronic device, including:
a processor; and
a memory storing computer instructions for causing the processor to perform the method according to any implementation of the first aspect.
In a fourth aspect, embodiments of the present disclosure provide a video communication system, including:
the display device comprises an image acquisition device and a first controller;
an acquisition device comprising a plurality of cameras and a second controller, at least one of the first controller and the second controller being for performing the method according to any embodiment of the first aspect.
In a fifth aspect, the disclosure provides a storage medium storing computer instructions for causing a computer to perform the method according to any of the embodiments of the first aspect.
The light field image processing method comprises the steps of obtaining a plurality of images to be processed, which are respectively collected through a plurality of cameras arranged on a collection device, determining the boundary range of a target object on the images to be processed for each image to be processed, cutting the images to be processed according to the boundary range and the preset image scale to obtain target images corresponding to each image to be processed, and carrying out viewpoint fusion processing on each target image according to target viewpoint information to obtain target light field images. In the embodiment of the disclosure, the image to be processed is cut based on the boundary range of the target object, so that the risk of placing the target object at the edge of the image and even cutting off the target object can be reduced, the quality of the cut image is improved, a better data base is provided for the subsequent viewpoint fusion processing, and the quality and efficiency of light field video communication are improved.
Drawings
In order to more clearly illustrate the embodiments of the present disclosure or the prior art, the drawings that are required in the detailed description or the prior art will be briefly described, it will be apparent that the drawings in the following description are some embodiments of the present disclosure, and other drawings may be obtained according to the drawings without inventive effort for a person of ordinary skill in the art.
Fig. 1 is an architecture diagram of a video communication system in accordance with some embodiments of the present disclosure.
Fig. 2 is a schematic structural diagram of an electronic device according to some embodiments of the present disclosure.
Fig. 3 is a flow chart of a light field image processing method in accordance with some embodiments of the present disclosure.
Fig. 4 is a schematic diagram of a light field image processing method in accordance with some embodiments of the present disclosure.
Fig. 5 is a flow chart of a light field image processing method in accordance with some embodiments of the present disclosure.
Fig. 6 is a flow chart of a light field image processing method in accordance with some embodiments of the present disclosure.
Fig. 7 is a schematic diagram of a light field image processing method in accordance with some embodiments of the present disclosure.
Fig. 8 is a schematic diagram of a light field image processing method in accordance with some embodiments of the present disclosure.
Fig. 9 is a schematic diagram of a light field image processing method in accordance with some embodiments of the present disclosure.
Fig. 10 is a flow chart of a light field image processing method in accordance with some embodiments of the present disclosure.
Fig. 11 is a comparison of effects of a light field image processing method in accordance with some embodiments of the present disclosure.
Fig. 12 is a flow chart of a light field image processing method in accordance with some embodiments of the present disclosure.
Fig. 13 is a schematic diagram of a light field image processing method in accordance with some embodiments of the present disclosure.
Fig. 14 is a block diagram of a light field image processing apparatus in accordance with some embodiments of the present disclosure.
Fig. 15 is a block diagram of an electronic device in accordance with some embodiments of the present disclosure.
Detailed Description
The following description of the embodiments of the present disclosure will be made clearly and fully with reference to the accompanying drawings, in which it is evident that the described embodiments are some, but not all, of the embodiments of the present disclosure. All other embodiments, which can be made by one of ordinary skill in the art without inventive effort, based on the embodiments in this disclosure are intended to be within the scope of this disclosure. In addition, technical features related to different embodiments of the present disclosure described below may be combined with each other as long as they do not make a conflict with each other.
The definition of a Light Field (Light Field) refers to the amount of Light passing through each point in each direction, and a Light Field image can record Light ray data of a higher dimension than a conventional two-dimensional image, thereby presenting three-dimensional information of higher accuracy than conventional two-dimensional imaging and conventional three-dimensional imaging represented by binocular stereoscopic vision.
The light field video can accurately sense a dynamic environment, and when the view point of a user changes in combination with the eyeball tracking technology, the video picture can follow the change of the view point in real time, so that the visual field video can be presented to the naked eye 3D viewing experience of the user in the scene.
The data acquisition of the light field video needs to use a camera array, wherein the camera array comprises a plurality of cameras with even tens of different visual angles, each camera is responsible for acquiring an image with one visual angle, and then the acquired image data is subjected to view fusion by utilizing a Multi-view Stereo matching reconstruction (MVS) algorithm and combining with a new view point to obtain a light field image under the new view point.
With the development of Machine Learning (Machine Learning) technology, three-dimensional reconstruction is performed on an input multi-view image by using a multi-view stereo matching model (MVS Net) based on a depth neural network, so that the operation efficiency and the presentation effect of a real-time light field video are greatly improved.
The input of MVS Net is a multi-view captured image captured by a camera array, and in the related art, in order to adapt to the input image scale of a model, on the one hand, and to remove part of irrelevant background information and improve the calculation efficiency and effect of the model, an image with a larger resolution (for example, 4096 pixels×3000 pixels) captured by the camera array is often needed to be cut into an input image with a small scale (for example, 2560 pixels×2560 pixels) according to the input of the model before the input of MVS Net.
In the related art, the cropping of the input image is generally performed based on a preset fixed position and scale. However, through practice, for a real-time light field video call scene, both communication parties mainly focus on the human body image of the opposite party, but the position of a person on an actually acquired image is not necessarily located in the center of a cutting range, so that the person is located at the edge of an input image and even part of the body of the person is cut off in the cutting algorithm process, difficulty is brought to the model matching reconstruction or network training process, and the finally output light field image effect is poor.
Based on the defects of the related art, the embodiment of the disclosure provides a light field image processing method, a device, an electronic device, a video communication system and a storage medium, which aim to improve the cutting effect on multi-viewpoint images in a light field video scene and further improve the processing effect and efficiency of the light field video.
Fig. 1 shows an architecture diagram of a video communication system in some embodiments of the present disclosure, and an application scenario of the embodiments of the present disclosure is described below with reference to fig. 1.
As shown in fig. 1, in some embodiments, the video communication system includes a capture device 100 and a display device 200, the capture device 100 and the display device 200 establishing a communicable connection through a wired or wireless network.
In one exemplary unidirectional video communication scenario, the capture device 100 may capture image data of the scene in which user a is located and transmit the image data to the display device 200. The display device 200 obtains the current viewing point position of the user B by tracking the eye position of the user B, performs viewpoint image synthesis by combining the viewpoint position and the light field image data sent by the acquisition device 100, and renders and displays the synthesized light field image on the display device 200.
Or, the display device 200 tracks the eye position of the user B to obtain the current viewing point position of the user B, and sends the viewing point position to the acquisition device 100, so that the acquisition device 100 combines the viewing point position and the image data of the scene where the user a is located, which is acquired by itself, to perform view image synthesis to obtain a light field image, and sends the light field image to the display device 200, so that the display device 200 renders and displays according to the received light field image.
It will be understood, of course, that the above example only exemplifies unidirectional video communication, but the present disclosure is not limited to unidirectional video communication scenes, and for bidirectional video communication scenes, the display device 200 may also collect light field image data of the scene where the user B is located and send the light field image data to the collection device 100. The acquisition device 100 may also track the eye position of the user a to obtain the current viewing viewpoint position of the user a, combine the viewpoint position with the light field image data sent by the display device 200 to perform viewpoint image synthesis, and render and display the synthesized light field image on the acquisition device 100. Those skilled in the art will appreciate that this disclosure is not repeated.
Taking a two-way video communication scenario as an example, fig. 2 shows a schematic structural diagram of an electronic device in some embodiments of the present disclosure, where the electronic device may be either the acquisition device 100 or the display device 200, and the present disclosure is not limited thereto.
As shown in fig. 2, the electronic apparatus includes a display screen 110, camera arrays C1 to C4, and an image pickup device C5.
The display screen 110 is used to display the Light field image, and the display screen 110 may be any suitable screen component, such as an LCD (Liquid Crystal Display) screen, an OLED (Organic Light-Emitting Diode) screen, or the like, which is not limited in this disclosure.
The camera array includes a plurality of cameras, and the cameras are disposed in an array on the electronic device, for example, in the example of fig. 2, the camera array includes 4 cameras, respectively, from C1 to C4, and the 4 cameras are disposed on 4 opposite corners of the electronic device, respectively. Because the positions and shooting visual angles of the cameras are different, a plurality of scene images with different visual angles can be acquired by using the camera array. Of course, the number of cameras and the manner of deployment included in the camera array is not limited to the example of fig. 2, but may be any other suitable manner, which is not limited by the present disclosure.
The image capturing device C5 is a camera for implementing eye tracking of the user, which may be, for example, a high-precision RGB camera, that is, the image capturing device C5 determines current viewpoint information of the user, that is, position information representing eyes of the user, by capturing a current scene image and performing image detection on the scene image. In the example of fig. 2, the image pickup device C5 is provided at the upper center of the display screen 110 of the electronic apparatus, but it is understood that the present disclosure is not limited to the specific position of the image pickup device C5.
On the basis of the video communication systems shown in fig. 1 and 2 described above, a light field image processing method according to an embodiment of the present disclosure will be described below.
It should be noted that, for ease of understanding, in the following embodiments of the present disclosure, an unidirectional video communication scene will be taken as an example, that is, the acquisition device 100 is used as a light field data acquisition end, the display device 200 is used as a light field video display end, and the principle of the bidirectional video communication scene is exactly the same as that, which is not repeated in the present disclosure.
In addition, in the embodiment of the present disclosure, one or more steps of the light field image processing method described below may be performed by the acquisition device 100, may be performed by the display device 200, or may be performed interactively by both the acquisition device 100 and the display device 200. The subject matter of the execution of the method steps will be described in detail hereinafter in this disclosure, and thus the method steps of the execution subject matter are not explicitly defined hereinafter, which means that no limitation is made.
As shown in fig. 3, in some embodiments, a light field image processing method of an example of the present disclosure includes:
s310, acquiring a plurality of images to be processed respectively acquired by a plurality of cameras arranged on the acquisition equipment.
In combination with the video communication scene, a plurality of cameras included in the camera array at the acquisition equipment end can acquire one acquisition image at the same time, and the acquisition image is the image to be processed in the disclosure. The image to be processed comprises a target object, namely a human body, and the plurality of images to be processed acquired by the camera array are images comprising different visual angles of the target object due to different positions and visual angles of the cameras.
Referring to fig. 1 and fig. 2, when a user a is located at the end of the acquisition device 100, the cameras C1 to C4 on the acquisition device 100 acquire one acquisition image including the user a at the same time, so that 4 acquisition images including different viewing angles of the user a can be obtained, and the 4 acquisition images are the 4 images to be processed in the disclosure.
S320, for each image to be processed, determining the boundary range of the target object on the image to be processed.
In combination with the foregoing, after the image to be processed is obtained, the image to be processed cannot be directly input into the MVS network model, but needs to be cut.
There are many reasons for the cropping of the image to be processed, including but not limited to the following:
1) For video communication scenes, the two communication parties only pay attention to the human body of the other party, and the attention to the image background is not high. However, because the field angle of the camera is large, the shot scene range is wide, so that the occupancy rate of the characters on the image to be processed is not high, and the character area needs to be cut out to highlight the characters.
2) More background information on the image to be processed can cause interference to the calculation and training of the MVS network, and extra calculation cost and difficulty are increased, so that the view point fusion effect is poor. Therefore, it is desirable to crop irrelevant background information to reduce computational overhead and difficulty.
3) The MVS network model is required for the size of the input image and is generally a square image, e.g., the input image of a typical MVS network has a scale of 2560 pixels by 2560 pixels. The image acquired by the camera has a larger scale, and square acquired images are fewer, for example, the scale of the image acquired by the camera is 4096 pixels by 3000 pixels. Thus, it is necessary to crop the acquired image into an image conforming to the size of the network model input image.
Conventional cropping schemes typically crop the image to be processed into an image that conforms to the size of the MVS input image based on a fixed location. For example, in the example of fig. 4, as shown in (a) of fig. 4, the outer solid line frame is the scale of the image to be processed, and the inner dashed line frame is the preset cutting position and range for cutting the image to be processed.
For scenes such as light field video communication, sometimes the person is not necessarily located in the center of the image to be processed, for example, as shown in fig. 4 (b), the person may appear in a position other than the center of the image to be processed, and if the person is cropped at a fixed position, the person is caused to shift to the edge of the image, and even the body of the person is cropped. And further brings difficulty to MVS matching reconstruction or network training process, resulting in poor effect of the finally output light field image.
Therefore, in the embodiment of the present disclosure, after each image to be processed is obtained, the image to be processed is not cut based on the fixed position, but the boundary range of the target object on the image to be processed is first determined, and then the image to be processed is cut based on the boundary range of the target object.
The boundary range of the target object refers to an image area formed by surrounding each boundary of the target object on the image to be processed. For example, in the scene shown in fig. 4, the target object on the image to be processed is a human body, the range of the image to be processed occupied by the human body can be represented by a rectangular frame, the left side edge of the rectangular frame is the left boundary of the target object, the right side edge of the rectangular frame is the right boundary of the target object, the upper side edge is the upper boundary of the target object, and the lower side edge is the lower boundary of the target object, so that a rectangular frame area surrounded by the upper, lower, left and right boundaries is the boundary range of the target object.
In some embodiments of the present disclosure, the boundary range of the target object on the image to be processed may be obtained by performing boundary search on the image to be processed. For a specific procedure of boundary search, the following embodiments of the present disclosure describe this.
In some embodiments of the present disclosure, before performing boundary search on an image to be processed, processing such as foreground-background segmentation and binarization may be performed on the image to be processed, so as to improve accuracy and effect of boundary search on a target object. It will of course be appreciated that these processes are optional and not required, and are described below in this disclosure and not described in detail herein.
S330, cutting the images to be processed according to the boundary range and the preset image scale to obtain target images corresponding to each image to be processed.
In the embodiment of the disclosure, after each image to be processed is processed to obtain the boundary range of the target object on each image to be processed, the image to be processed can be cut according to the boundary range of the target object.
The preset image scale is the scale of the target image obtained after the image to be processed is cut. Taking the MVS network for three-dimensional stereo matching reconstruction as an example, the preset image scale may be set according to the input image scale of the MVS network, for example, in one example, the input image scale of the MVS network is required to be 2560 pixels by 2560 pixels, and then the preset image scale may be set to be 2560 pixels by 2560 pixels.
Taking an image to be processed as an example, after determining the boundary range of the target object on the image to be processed, the image to be processed needs to be cut into an image with a preset image scale according to the boundary range, so that the target image after the cutting of the image to be processed is obtained.
In some embodiments, the boundary range of the target object on the image to be processed can be determined, the center point coordinate of the target object is determined, then the expansion is performed based on the center point coordinate, the image range after the expansion is the preset image scale size, and then the cutting processing is performed on the image to be processed based on the image range after the expansion, so that the target image corresponding to the image to be processed can be obtained. This process is described in the embodiments of the present disclosure below.
And cutting each image to be processed through the process, so that a target image corresponding to each image to be processed can be obtained, and the size of the target image is the preset image size.
And S340, performing viewpoint fusion processing on each target image according to the target viewpoint information to obtain a target light field image corresponding to the target viewpoint information.
As can be understood from the video communication scene shown in fig. 1, the image to be processed refers to the scene image of the user a acquired by the acquisition device 100, and the final target light field image is a three-dimensional image that needs to be rendered and displayed on the display device 200, so that the follow-up effect of the target light field image needs to be determined based on the viewpoint information of the user B. That is, the target light field image finally presented at the display device 200 end needs to be obtained by following the fusion of the viewpoint information of the user B, and thus, the display device 200 needs to acquire the viewpoint information of the user B.
In the embodiment of the present disclosure, the target viewpoint information, that is, the position information indicating the eyes of the observer at the side of the display apparatus 200, may reflect the viewpoint position currently viewed by the observer at the side of the display apparatus 200. The target viewpoint information is acquired by an image acquisition device C5 at the end of the display device 200 and obtained by using an eyeball tracking algorithm. It can be appreciated that, for the process of determining the target viewpoint information on the display device 200 side, those skilled in the art can refer to the related art and combine with the conventional eye tracking algorithm, which will not be described in detail in this disclosure.
After the target viewpoint information and each cut target image are determined, viewpoint fusion processing can be carried out on each target image by utilizing the target viewpoint information, and finally, a target light field image corresponding to the target viewpoint information can be obtained.
In some embodiments, the process of performing viewpoint fusion processing on the target image may be implemented by using an MVS network model based on a deep neural network (DNN, deep Neural Networks), that is, performing network training on the MVS network in advance to obtain an MVS network model with a better convergence effect, and then taking the target viewpoint information and each target image as input of the MVS network model, performing viewpoint fusion processing on the MVS network based on each input, and predicting and outputting the target light field image. The structure and principles of the MVS network are described in the following embodiments, and are not described in detail herein.
After the target light field image is obtained, the display device 200 can render and display the target light field image, and it can be understood that, because the target light field image is an image generated based on the current target viewpoint information of the observer at the display device 200 end, the naked eye 3D effect which follows according to the viewpoint change of the observer can be realized, so that the observer at the display device 200 end has an immersive video experience.
It should be noted that, in the whole method process, only the process of capturing the image to be processed must be performed by the capturing device 100, the process of rendering and displaying the target light field image must be performed by the display device 200, and the other method processes may be performed by the capturing device 100, the display device 200, or a third party device (such as a server) that establishes a communication connection with the capturing device 100 and the display device 200 together, which is not limited in this disclosure.
For example, in one exemplary scenario, the foregoing method processes S310-S340 are each performed by the acquisition device 100, after obtaining the target light field image, the acquisition device 100 sends the target light field image to the display device 200, and the display device 200 renders the target light field image.
For example, in another exemplary scenario, the foregoing method procedure only S310 performs processing by the acquisition device 100 during the process of acquiring the image to be processed, then the acquisition device 100 transmits the image to be processed to the display device 200, and the remaining method procedures each perform processing by the display device 200 and finally render the display target light field image.
For example, in still another exemplary scenario, the foregoing method procedure only S310 performs processing by the acquisition device 100, then the acquisition device 100 transmits the image to be processed to the third party server, the other method procedures each perform processing by the third party server, then the third party server transmits the obtained target light field image to the display device 200, and the display device 200 renders the target light field image.
Of course, those skilled in the art will appreciate that the implementation of the execution body is not limited to the above examples, and this disclosure will not be repeated.
It can be appreciated that in the embodiment of the disclosure, the clipping processing of the image to be processed is not performed based on a fixed position, but is performed according to the boundary range of the target object, so that the target object on the clipped target image is always located at the center position, and the risk of placing the target object at the image edge and even clipping off is reduced.
Furthermore, as the duty ratio range of the target object in the target image is larger, the effect and the precision of the subsequent MVS network characteristic extraction are improved, and the effect of the light field image output by network training and prediction is enabled. In addition, as the target object is positioned in the center of the target image, surrounding scene information of the target object can be included as much as possible, and the accuracy and effect of the MVS network are further improved by combining more context characteristic information.
As can be seen from the foregoing, in the embodiment of the present disclosure, the image to be processed is cut based on the boundary range of the target object, so that the risk of placing the target object at the edge of the image and even cutting off the target object can be reduced, the quality of the cut image is improved, a better data base is provided for the subsequent viewpoint fusion processing, and the quality and efficiency of the light field video communication are further improved.
In the following embodiments of the present disclosure, a light field image processing method according to the embodiments of the present disclosure will be further described by taking a video communication system shown in fig. 1 and 2 as an example.
As shown in fig. 5, in some embodiments, the light field image processing method illustrated in the disclosure, a process of determining a boundary range of a target object on each image to be processed includes:
s510, carrying out binarization processing on each image to be processed, and obtaining a binary image of the target image.
It can be understood that, in the embodiment of the present disclosure, the boundary range of the target object refers to an Image area surrounded by each boundary of the target object on the Image to be processed, in other words, in the video communication scene of the example of the present disclosure, the boundary of the target object is mainly focused, so that the original Image to be processed can be converted into a Binary Image (Binary Image).
The binary image is defined as an image in which each pixel has only two possible values, generally indicated by pixel values 0 and 1, 0 representing a black pixel and 1 representing a white pixel, that is, only a black-and-white pixel. In the embodiment of the disclosure, the target object and the non-target object may be divided into pixels by using a binary image, for example, in one example, a pixel belonging to the target object is a black pixel (the pixel value is 0), and other pixels are white pixels (the pixel value is 1).
In addition, it can be understood that, for a real video communication scene, the image to be processed acquired by the acquisition device 100 may include not only a foreground object but also a background area, and in some embodiments of the present disclosure, before binarizing the image to be processed, the image to be processed may be further processed by matting, where the background area is removed and only the foreground object is retained, which will be described below with reference to fig. 6.
As shown in fig. 6, in some embodiments, a light field image processing method of an example of the present disclosure, a process of performing binarization processing on an image to be processed, includes:
s511, carrying out matting processing on each image to be processed to obtain a foreground image which corresponds to each image to be processed and comprises the target object.
S512, binarizing each foreground image to obtain a binary image of the target object.
In the embodiment of the present disclosure, after a plurality of images to be processed are obtained by the acquisition device 100, image segmentation may be performed on each image to be processed, so as to implement matting of a foreground target object and a background.
It will be appreciated that the object of the matting process is to divide the target object from the background, for example, for the video communication scene shown in fig. 1, the matting process of the image to be processed is to divide the person (user a) of the foreground from the background, so that the obtained foreground image only includes the person, and the background area can be filled with a single pixel.
In the related art, there are many matting algorithms, and any matting algorithm can be adopted by a person skilled in the art. For example, in one example, a DNN-based image segmentation model may be used, and each image to be processed is input into a pre-trained image segmentation model, so as to obtain a foreground image output by the model.
In other embodiments, as well known in connection with video communication scenarios, for example, video conferencing, large screen electronic devices used to implement video conferencing tend to be relatively stationary, so that the background portion of the scene image acquired by the acquisition device 100 will hardly change during video communication, and typically only foreground persons or objects will move.
Accordingly, in the embodiment of the present disclosure, the capturing apparatus 100 may capture and save a background image excluding the target object in advance, for example, the capturing apparatus 100 may capture and save a background image at the time of power-on. Then, when the image to be processed is segmented, based on the difference between the image to be processed and the pre-stored background image, quick image matting of the image to be processed is achieved.
For example, in one example, after a certain to-be-processed image is scratched, the obtained foreground image may be as shown in fig. 7 (a). It will be appreciated that since the color is lost after the graying process for the specification drawing, the gray map is presented in fig. 7 (a), which may be a color (RGB) image itself, and this disclosure will not be repeated.
In the embodiment of the disclosure, after the foreground image corresponding to each image to be processed is obtained, binarization processing can be performed on each foreground image to obtain a binary image corresponding to each image to be processed.
Taking the foreground image shown in fig. 7 (a) as an example, in some embodiments, the foreground image may be first converted into a gray image, where the purpose of the conversion into the gray image is to remove the color of the foreground image, so as to reduce the calculation amount. The process of converting a gray image is expressed as:
gray=0.302*R+0.566*G+0.132*B (1)
In formula (1), gray represents a gray value of each pixel on the gray image, and R, G, B represents an RGB value of each pixel on the foreground image. Each pixel on the foreground image can be converted into a gray value by formula (1), resulting in a gray image as shown in fig. 7 (b), for example.
After the gray level image corresponding to the foreground image is obtained, the gray level image can be converted into a binary image. For example, in one example, a gray threshold may be preset, and then the gray value of each pixel on the gray image is compared with the gray threshold, and if the gray value of a certain pixel is greater than or equal to the gray threshold, the pixel value of the certain pixel may be set to 1, that is, white; if the gray value of a certain pixel is smaller than the gray threshold, the pixel value of the pixel may be set to 0, i.e. black. The specific value of the gray threshold may be selected according to practical situations, which is not limited in this disclosure.
Thus, after traversing all pixels of the entire gray image, a binary image including only 0 and 1 pixels can be obtained, for example, in the above example, after converting the gray image shown in (b) of fig. 7 into a binary image, a binary image shown in (c) of fig. 7 can be obtained, and it can be seen that only black and white pixels are included on the binary image, wherein the black pixels represent the target object and the white pixels represent the non-target object.
The above description is only directed to the binarization processing procedure of one of the images to be processed, and the binarization map of each image to be processed can be obtained by sequentially passing through the above procedures for a plurality of images to be processed acquired by the acquisition device 100.
In some embodiments, after obtaining the binary image of each image to be processed, noise may appear on the image, for example, black noise may exist in a white region of the binary image, white noise may also exist in a black region, and these noise may interfere with subsequent boundary searches. Thus, in some embodiments, after the binary image is obtained, the binary image may be further denoised.
With continued reference to fig. 6, in some embodiments, the process of denoising the binary image according to the light field image processing method illustrated in the present disclosure includes:
s513, searching on the binary image according to a preset step length by using a sliding window with a preset scale based on the pixel value on the binary image.
S514, in each sliding window, denoising the pixels included in the sliding window based on the sum of the pixel values of the pixels included in the sliding window.
In the embodiment of the disclosure, a sliding window traversing binary image mode is adopted to denoise black and white noise points on the binary image. For example, a sliding window with a preset scale m×n may traverse the entire binary image according to a preset step d, and specific values of the preset scale m×n and the preset step d may be selected according to a denoising precision requirement.
In one example scenario, the sliding window has a scale of 2 x 2, a preset step size d=1. That is, as shown in fig. 8, the sliding window may have a preset scale of 2×2, and a step size of 1 pixel slides on the binary image at a time.
Taking one sliding as an example, the sliding window may frame 4 pixels on the binary image, that is, pixel a, pixel B, pixel C, and pixel D shown in fig. 8, and the pixel value of each pixel is 0 or 1.
Therefore, the sum of pixel values of pixels A-D in the sliding window can be calculated, and the denoising process is realized according to the relation of a formula (2) according to the sum of the pixel values:
referring to fig. 8 and formula (2), if the sum of the pixel values of the pixels a to D is equal to 0, it is explained that all the pixels a to D are black pixels, and the sliding window is located inside the target object, and at this time, none of the pixels a to D is a noise point. If the sum of the pixel values of pixels a-D is equal to 1 or 2, it is indicated that there are 1 or 2 white pixels in pixels a-D, and at this time, it is considered that there are white noise points, and all the pixel values in the sliding window are set to 0, that is, the white noise points are converted to black pixels. If the sum of the pixel values of the pixels a to D is equal to 3, it is indicated that there are 1 black pixel and 3 white pixels in the pixels a to D, and it is considered that there is a black noise, so that the pixel value of the black pixel is set to 1 and converted to white. If the sum of the pixel values of the pixels A-D is equal to 4, the pixels A-D are all white pixels, and the sliding window is positioned in the background area, and at the moment, the pixels A-D are not noise points.
Through the denoising process of the formula (2), the whole binary image is traversed by utilizing the sliding window, so that the denoising processing of the binary image can be completed, and the denoised binary image is obtained.
The above description is only directed to the denoising process of one binary image, and the denoising process of each binary image can be completed by sequentially passing through the above process for the binary images corresponding to the multiple images to be processed acquired by the acquisition device 100.
S520, carrying out boundary search on the target object row by row and column by column based on the pixel values on the binary image to obtain a horizontal boundary and a vertical boundary of the target object on the binary image.
In the disclosed embodiment, the boundary range of the target object includes a horizontal boundary and a vertical boundary, that is, the boundary range of the target object is a rectangular frame area, and the whole boundary range is formed by enclosing two horizontal boundaries and two vertical boundaries of the rectangular frame.
For example, in one example, a binary image of an image to be processed may be shown in fig. 7 (c), and a process of performing a boundary search on the binary image may be shown in fig. 9, and a boundary search process of an example of the present disclosure will be described below with reference to fig. 9.
As shown in fig. 9, first, according to the pixel values on the binary image, the pixel values of the pixels in the column are detected sequentially from left to right, and the number of black pixels in the column is counted, expressed as Where i denotes the i-th pixel, n denotes the image height, value i Representing the pixel value of the i-th pixel.
If cout (0) =0, it indicates that the number of pixels with a pixel value of 0 in the column of pixels is 0, that is, the number of black pixels is 0, the next column detection is continued until the number of black pixels in a certain column of pixels is not 0.
For example, when it is detected that the number of black pixels in the kth column of pixels is not 0, that is, the column of pixels includes black pixels, the next column (k+1 column) detection is continued while p=1 is counted. If the black pixel is not included in the k+1 column of pixels, the black pixel detected in the k column is indicated as a noise point, and the count p is reset to zero again. Conversely, if the k+1 column of pixels includes black pixels, the count p is incremented by 1, i.e., p=2, and the k+2 column detection is continued.
Thus, until the count p reaches a preset value, the k-th column pixel and the subsequent continuous preset number of columns of pixels comprise black pixels, and at this time, the coordinate information corresponding to the k-th column pixel can be determined to be the left boundary of the target object on the binary image.
Similarly, the right boundary of the target object on the binary image can be obtained by sequentially detecting from right to left column by column, which is understood and fully implemented by those skilled in the art with reference to the foregoing, and will not be repeated in this disclosure. After the left and right boundaries are determined, the horizontal boundaries of the boundary range of the target object may be determined.
For the vertical boundary, as shown in fig. 9, the pixel values of the pixels in the row are detected sequentially row by row from top to bottom according to the pixel values on the binary image, and the number of black pixels in the row is counted and expressed asWhere i denotes the i-th pixel, m denotes the image width, value i Representing the pixel value of the i-th pixel.
If cout (0) =0, it indicates that the number of pixels with a pixel value of 0 in the pixels of the row is 0, that is, the number of black pixels is 0, the next row detection is continued until the number of black pixels in a certain row is not 0.
For example, when it is detected that the number of black pixels in the J-th row of pixels is not 0, that is, the row of pixels includes black pixels, the next row (j+1 row) of detection is continued while p=1 is counted. If the j+1 row of pixels does not include black pixels, the j row of pixels is detected as noise, and the count p is reset to zero again. Conversely, if the j+1 row of pixels includes black pixels, the count p is incremented by 1, i.e., p=2, and j+2 row detection is continued.
Thus, until the count p reaches a preset value, the j-th row of pixels and the subsequent continuous preset number of rows of pixels comprise black pixels, and at this time, the coordinate information corresponding to the j-th row of pixels can be determined to be the upper boundary of the target object on the binary image.
Similarly, the lower boundary of the target object on the binary image can be obtained by sequentially detecting the target object row by row from bottom to top, which is not repeated in the present disclosure, as can be understood and fully implemented by those skilled in the art with reference to the foregoing. After the upper and lower boundaries are determined, the vertical boundaries of the boundary range of the target object may be determined.
It should be noted that, referring to fig. 9, in the video communication scenario of the example of the present disclosure, considering that only the upper body of the user is often shot, when performing boundary search on the human body, the lower boundary of the binary image may be determined as the lower boundary of the target object by default, that is, the boundary search of the lower boundary is not required, and only the left boundary, the upper boundary, and the right boundary need be determined.
In addition, as shown in fig. 9, after the left boundary is determined, in the process of performing the upper boundary search, it is not necessary to detect all pixels of the entire row, but only the pixels from the left boundary to the rightmost end. Similarly, after the upper boundary is determined, in the process of searching the right boundary, all pixels in the whole column do not need to be detected, and only the pixels from the upper boundary to the lowest end need to be detected. Therefore, redundant detection of a large number of pixels can be reduced, and the operation efficiency is improved.
S530, determining a boundary range based on the horizontal boundary and the vertical boundary.
After the horizontal boundary and the vertical boundary of the target object on the binary image are determined, the boundary range of the target object on the image to be processed can be determined according to the coordinate information of the horizontal boundary and the vertical boundary. For example, in the example of fig. 9, a rectangular range selected by using a left boundary, an upper boundary, a right boundary, and a lower boundary box, that is, a boundary range representing the target object.
The above description is only directed to the boundary searching process of one of the images to be processed, and the boundary range of the target object on each image to be processed can be obtained by sequentially passing through the above processes for the plurality of images to be processed acquired by the acquisition device 100.
As can be seen from the foregoing, in the embodiment of the present disclosure, the boundary range of the target object can be quickly determined based on the binary image boundary searching method, and the boundary detection accuracy of the target object is higher, so as to provide an accurate data basis for subsequent image cutting.
As shown in fig. 10, in some embodiments, the light field image processing method of the present disclosure includes a process of clipping an image to be processed based on a boundary range, including
S531, determining the coordinates of the central point of the target object according to the boundary range.
S532, determining the center point coordinates of the target object as the center point coordinates of the target image, and cutting the image to be processed according to the preset image scale to obtain the target image.
Still taking fig. 9 as an example, after obtaining the boundary range of the target object, the coordinates (left, top) of the upper left corner of the boundary range can be obtained based on the coordinate information left of the left boundary and the coordinate information top of the upper boundary.
Meanwhile, given that the size m×n of the binary image is known, in one example, the size of the binary image is 4096×3000, the center point coordinates (Cx, cy) of the target object are expressed as:
Cx=left+(right-left)/2
Cy=top+(3000-top)/2
after the center point coordinates O (Cx, cy) of the boundary range of the target object are determined, the center point coordinates O (Cx, cy) can be used as the center point coordinates of the target image, so that the image range of the target image can be obtained by performing outward expansion according to a preset image scale based on the center point coordinates, and then the image to be processed is cut based on the image range, and the target image can be obtained.
For example, in the example of fig. 9, assuming that the preset image scale of the target image is 2560×2560 pixels, the image range after the preset image scale is externally expanded according to the center point coordinate O (Cx, cy) is shown by a dashed line in the figure, and the scale of the dashed line box is 2560×2560 pixels. Then, the image to be processed is cut based on the image range of the dotted line frame, and the target image can be obtained.
Referring to fig. 11, the first 4 images in fig. 11 are 4 images to be processed acquired by the cameras C1 to C4 of the acquisition device 100, respectively. The second row of 4 images in fig. 11 is the effect of cutting the image to be processed based on the fixed position in the conventional scheme. The third row of 4 images in fig. 11 is an effect of cutting an image to be processed based on a boundary search of a target object by using the method of the embodiment of the present disclosure.
As can be seen by comparing fig. 11, after the image to be processed is cut by the conventional scheme, the person is not located in the center of the image, and a large degree of deviation occurs, and even part of the body of the person may be cut away. However, after the image to be processed is cropped by adopting the embodiment of the disclosure, the character is always kept in the center of the image, and an accurate data basis is provided for subsequent viewpoint fusion.
As can be seen from the foregoing, in the embodiment of the present disclosure, the image to be processed is cut based on the boundary range of the target object, so that the risk of placing the target object at the edge of the image and even cutting off the target object can be reduced, the quality of the cut image is improved, a better data base is provided for the subsequent viewpoint fusion processing, and the quality and efficiency of the light field video communication are further improved.
In the embodiment of the disclosure, after the target viewpoint information and each cut target image are determined, the viewpoint fusion processing can be performed on each target image by using the target viewpoint information, and finally, the target light field image corresponding to the target viewpoint information can be obtained.
In some embodiments, the process of performing view fusion processing on the target image may be implemented using a deep neural network (DNN, deep Neural Networks) based MVS network model, as described below in connection with fig. 12.
As shown in fig. 12, in some embodiments, a light field image processing method of an example of the present disclosure, a process of performing a view fusion process on each target image according to target view information, includes:
s1210, inputting at least two images to be processed in the plurality of images to be processed into a pre-trained depth network model to obtain a depth map of the target object output by the depth network model.
S1220, performing viewpoint fusion processing on the depth map based on the target viewpoint information to obtain a target viewpoint depth map under the viewpoint corresponding to the target viewpoint information.
S1230, inputting the target image, the target viewpoint depth map and the target viewpoint information into a pre-trained viewpoint fusion model to obtain a target light field image output by the viewpoint fusion model.
As shown in fig. 13, in an embodiment of the present disclosure, the MVS network model includes a depth network model and a view fusion model. The Depth network model is used for predicting the Depth characteristics of the target object to obtain a Depth map of the target object, and the Depth network model can be a Depth prediction model based on a Depth Nnet architecture, for example. The viewpoint fusion model is used for presetting a viewpoint image of the target object under a new viewpoint, and the viewpoint fusion model can be a neural network model based on MVS Net architecture, for example.
In the embodiment of the present disclosure, in combination with the scenes shown in fig. 1 and 2, the cameras C1 to C4 on the acquisition device 100 need to be calibrated first to obtain an internal reference matrix and an external reference matrix of each camera.
Specifically, the optical axes of the 4 cameras can be adjusted to the midpoint position of a clear plane, wherein the clear plane refers to a plane in which the imaging of the cameras is clear, and the focal lengths of the rest cameras are related. Then, the coordinate system of the camera C1 can be used as a world coordinate system, and the other cameras are calibrated by using a Zhang calibration method to obtain an internal reference matrix and an external reference matrix of each camera. The internal reference matrices are denoted as K1, K2, K3 and K4, respectively, and the external reference matrices are denoted as (R1, t 1), (R2, t 2), (R3, t 3) and (R4, t 4), respectively, where R represents the relative rotation matrix and t represents the relative translation matrix.
It will be appreciated that camera internal and external parameters are necessary parameters for the camera to perform a coordinate system transformation, and that camera internal and external parameters are in the form of matrices, i.e. camera internal and external matrices. The camera internal reference matrix is used for converting a camera coordinate system into a pixel coordinate system, and the camera external reference matrix is used for converting a world coordinate system into a camera coordinate system.
For the specific algorithm and process of camera calibration, those skilled in the art can certainly understand and fully implement the method of Zhang's calibration in the related art, and this disclosure will not be repeated here.
With continued reference to fig. 13, in an actual video communication scenario, the synchronization box may control the capturing device 100 to capture images by the cameras C1 to C4 simultaneously, so as to obtain the images I1 to I4 to be processed respectively. The synchronous box is a synchronous hardware device and is used for controlling the cameras C1-C4 to synchronously acquire images, so that the acquired images I1-I4 to be processed are images of different visual angles of a target object acquired at the same moment.
In the embodiment of the disclosure, at least two images to be processed I1 to I4 need to be input to the depth network model, for example, in the example of fig. 13, the images to be processed I1 and T2 may be input to the depth network model. The depth network model is used for predicting the depth information of each point on the target object, and outputting a depth map of the target object, wherein each pixel on the depth map can represent the depth value of the point on the target object.
The depth network model can be obtained by training based on sample labeling data in advance, and the conventional supervised training process is adopted for the network training process of the depth network model, which is not described in detail in the present disclosure.
After the depth map of the target object is obtained, the depth map needs to be subjected to viewpoint fusion processing by combining target viewpoint information, so that the target viewpoint depth map under a new viewpoint is obtained. As can be seen from the foregoing, the target viewpoint information refers to the current viewpoint information of the user B at the display device 200, that is, the viewpoint corresponding to the target viewpoint information needs to be converted from the depth map.
It should be noted that, in the process of performing viewpoint fusion processing on the depth map in combination with the target viewpoint information, coordinate system transformation needs to be implemented according to the internal reference matrix and the external reference matrix of the camera. The object image size after cutting and the size of the image to be processed are changed, so that the internal reference matrix of the camera is also changed, and the internal reference matrix of each camera is required to be updated.
For example, taking the camera C1 as an example, the reference matrix K1 is expressed as:
/>
the image Scale of the image to be processed is 4096×3000 pixels, the target image Scale after cutting is 2560×2560 pixels, then the horizontal scaling factor scale_x=4096/2560, the vertical scaling factor scale_y=3000/2560, the horizontal offset trans_x=4096/2-Cx, and the vertical offset trans_x=3000/2-Cy in the cutting process. Thus, the updated internal reference matrix K1' of the camera C1 is expressed as:
The above description is only performed by the reference matrix updating process of the camera C1, and the reference matrix updating process of the cameras C2 to C4 is the same as the reference matrix updating process, which is not repeated in the disclosure.
And performing viewpoint fusion processing on the depth map according to the target viewpoint information based on the updated internal reference matrix and the updated external reference matrix, and converting the depth map to a viewpoint corresponding to the target viewpoint information to obtain the target viewpoint depth map. For the specific procedure and principle of the viewpoint fusion processing, those skilled in the art will understand and fully implement the relevant technology, and this disclosure will not be repeated.
With continued reference to fig. 13, the cropping module performs cropping on the images to be processed I1 to I4 acquired by the cameras C1 to C4 based on the foregoing method process of the present disclosure, to obtain target images corresponding to each image to be processed, which are respectively the target images I1 'to I4'. Those skilled in the art can refer to the foregoing, and will not be described in detail herein.
The input of the viewpoint fusion model comprises target viewpoint information, target viewpoint depth maps and target images I1 'to I4', and the input of the viewpoint fusion model is the target light field image corresponding to the target viewpoint information. The viewpoint fusion model can be obtained by training based on sample labeling data in advance, and the conventional supervised training process is adopted for the network training process of the viewpoint fusion model, which is not described in detail in the present disclosure.
Through the method, the target light field image can be obtained. In some embodiments, after obtaining the target light field image, the acquisition device 100 may send the target light field image to the display device 200, and the display device 200 may render and display the target light field image on the display screen, so that the user B at the display device 200 may see the light field video that changes along with the viewpoint of the user B, and present an immersive naked eye 3D effect.
As can be seen from the foregoing, in the embodiment of the present disclosure, the image to be processed is cut based on the boundary range of the target object, so that the risk of placing the target object at the edge of the image and even cutting off the target object can be reduced, the quality of the cut image is improved, a better data base is provided for the subsequent viewpoint fusion processing, and the quality and efficiency of the light field video communication are further improved.
In some embodiments, the present disclosure provides a light field image processing apparatus, referring to fig. 14, the light field image processing apparatus of an example of the present disclosure includes:
an image acquisition module 10 configured to acquire a plurality of images to be processed acquired by a plurality of cameras provided on an acquisition device, respectively; the plurality of images to be processed are acquired images with different visual angles of the target object;
A boundary search module 20 configured to determine, for each image to be processed, a boundary range of the target object on the image to be processed;
the cropping module 30 is configured to crop the image to be processed according to the boundary range and the preset image scale to obtain a target image corresponding to each image to be processed;
the viewpoint fusion module 40 is configured to perform viewpoint fusion processing on each target image according to the target viewpoint information, so as to obtain a target light field image corresponding to the target viewpoint information; the target viewpoint information indicates position information of eyes of an observer at the display device side.
In some implementations, the boundary search module 20 is configured to:
performing binarization processing on each image to be processed to obtain a binary image of the target object;
carrying out boundary search on the target object row by row and column by column in sequence based on the pixel value on the binary image to obtain a horizontal boundary and a vertical boundary of the target object on the binary image;
based on the horizontal boundary and the vertical boundary, a boundary range is determined.
In some implementations, the boundary search module 20 is configured to:
detecting the number of black pixels of the column of pixels from left to right in sequence on the basis of pixel values on the binary image, and determining coordinate information corresponding to the first column of pixels as a left boundary of a horizontal boundary in response to the number of black pixels of the first column of pixels and the continuous preset number of columns of pixels after the first column of pixels are all larger than a first preset threshold;
Detecting the number of black pixels of the column of pixels from right to left sequentially column by column, and determining coordinate information corresponding to the second column of pixels as a right boundary of the horizontal boundary in response to the number of black pixels of the second column of pixels and the subsequent continuous preset number of columns of pixels being larger than a second preset threshold;
detecting the number of black pixels of the row of pixels line by line in sequence from top to bottom, and determining coordinate information corresponding to the first row of pixels as an upper boundary of a vertical boundary in response to the number of black pixels of the first row of pixels and the continuous preset number of rows of pixels behind the first row of pixels being larger than a third preset threshold;
detecting the number of black pixels of the row line by line in sequence from bottom to top, and determining coordinate information corresponding to the pixels of the second row as a lower boundary of the vertical boundary in response to the number of black pixels of the second row and the continuous preset number of the pixels of the row after the pixels of the second row are all larger than a fourth preset threshold.
In some implementations, the boundary search module 20 is configured to:
carrying out matting processing on each image to be processed to obtain a foreground image which corresponds to each image to be processed and comprises a target object;
and carrying out binarization processing on each foreground image to obtain a binary image of the target object.
In some implementations, the boundary search module 20 is configured to:
searching on the binary image according to a preset step length by using a sliding window with a preset scale based on the pixel value on the binary image;
in each sliding window, denoising processing is performed on pixels included in the sliding window based on the sum of pixel values of the pixels included in the sliding window.
In some implementations, the cutoff processing module 30 is configured to:
determining the coordinates of the central point of the target object according to the boundary range;
and determining the center point coordinate of the target object as the center point coordinate of the target image, and cutting the image to be processed according to the preset image scale to obtain the target image.
In some implementations, the view fusion module 40 is configured to:
inputting at least two images to be processed in a plurality of images to be processed into a pre-trained depth network model to obtain a depth map of a target object output by the depth network model;
performing viewpoint fusion processing on the depth map based on the target viewpoint information to obtain a target viewpoint depth map under the viewpoint corresponding to the target viewpoint information;
and inputting the target image, the target viewpoint depth map and the target viewpoint information into a pre-trained viewpoint fusion model to obtain a target light field image output by the viewpoint fusion model.
In some embodiments, the apparatus of the present disclosure is applied to an acquisition device, further comprising a transmission module configured to:
the target light field image is sent to the display device to cause the display device to render display the target light field image.
As can be seen from the foregoing, in the embodiment of the present disclosure, the image to be processed is cut based on the boundary range of the target object, so that the risk of placing the target object at the edge of the image and even cutting off the target object can be reduced, the quality of the cut image is improved, a better data base is provided for the subsequent viewpoint fusion processing, and the quality and efficiency of the light field video communication are further improved.
In some embodiments, the present disclosure provides a video communication system, which may be as shown in fig. 1, comprising:
a display device 200 including an image capture device and a first controller;
the acquisition device 100 comprises a plurality of cameras and a second controller, at least one of the first controller and the second controller being for performing the method according to any of the embodiments described above.
In some embodiments, the present disclosure provides a storage medium storing computer instructions for causing a computer to perform the method of any of the above embodiments.
In some embodiments, the present disclosure provides an electronic device comprising:
a processor; and
and a memory storing computer instructions for causing the processor to perform the method of any of the embodiments described above.
In the embodiment of the present disclosure, the electronic device may be the acquisition device 100 or the display device 200, which is not limited in this disclosure. Specifically, fig. 15 shows a schematic structural diagram of an electronic device 600 suitable for implementing the method of the present disclosure, and by using the electronic device shown in fig. 15, the corresponding functions of the processor, the controller, and the storage medium described above may be implemented.
As shown in fig. 15, the electronic device 600 includes a processor 601 that can perform various appropriate actions and processes according to a program stored in a memory 602 or a program loaded into the memory 602 from a storage portion 608. In the memory 602, various programs and data required for the operation of the electronic device 600 are also stored. The processor 601 and the memory 602 are connected to each other via a bus 604. An input/output (I/O) interface 605 is also connected to bus 604.
The following components are connected to the I/O interface 605: an input portion 606 including a keyboard, mouse, etc.; an output portion 607 including a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, a speaker, and the like; a storage section 608 including a hard disk and the like; and a communication section 609 including a network interface card such as a LAN card, a modem, or the like. The communication section 609 performs communication processing via a network such as the internet. The drive 610 is also connected to the I/O interface 605 as needed. Removable media 611 such as a magnetic disk, an optical disk, a magneto-optical disk, a semiconductor memory, or the like is installed as needed on drive 610 so that a computer program read therefrom is installed as needed into storage section 608.
In particular, according to embodiments of the present disclosure, the above method processes may be implemented as a computer software program. For example, embodiments of the present disclosure include a computer program product comprising a computer program tangibly embodied on a machine-readable medium, the computer program comprising program code for performing the method described above. In such an embodiment, the computer program can be downloaded and installed from a network through the communication portion 609, and/or installed from the removable medium 611.
The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
It should be apparent that the above embodiments are merely examples for clarity of illustration and are not limiting of the embodiments. Other variations or modifications of the above teachings will be apparent to those of ordinary skill in the art. It is not necessary here nor is it exhaustive of all embodiments. While still being apparent from variations or modifications that may be made by those skilled in the art are within the scope of the present disclosure.

Claims (12)

1. A light field image processing method, comprising:
acquiring a plurality of images to be processed respectively acquired by a plurality of cameras arranged on acquisition equipment; the plurality of images to be processed are acquired images with different visual angles of the target object;
for each image to be processed, determining the boundary range of the target object on the image to be processed;
cutting the images to be processed according to the boundary range and the preset image scale to obtain target images corresponding to each image to be processed;
performing viewpoint fusion processing on each target image according to target viewpoint information to obtain a target light field image corresponding to the target viewpoint information; the target viewpoint information represents position information of eyes of an observer at the display device side.
2. The method of claim 1, wherein for each image to be processed, determining a boundary range of the target object on the image to be processed comprises:
for each image to be processed, carrying out binarization processing on the image to be processed to obtain a binary image of the target object;
based on the pixel values on the binary image, sequentially carrying out boundary search on the target object row by row and column by column to obtain a horizontal boundary and a vertical boundary of the target object on the binary image;
the boundary range is determined based on the horizontal boundary and the vertical boundary.
3. The method according to claim 2, wherein the performing boundary search on the target object sequentially row by row and column by column based on the pixel values on the binary image to obtain a horizontal boundary and a vertical boundary of the target object on the binary image includes at least one of:
detecting the number of black pixels of the column of pixels from left to right in sequence on the basis of the pixel values on the binary image, and determining coordinate information corresponding to the first column of pixels as a left boundary of the horizontal boundary in response to the number of black pixels of the first column of pixels and the following continuous preset number of columns of pixels being larger than a first preset threshold;
Detecting the number of black pixels of the column of pixels from right to left sequentially column by column, and determining coordinate information corresponding to the second column of pixels as a right boundary of the horizontal boundary in response to the number of black pixels of the second column of pixels and the following continuous preset number of columns of pixels being larger than a second preset threshold;
detecting the number of black pixels of the row of pixels line by line in sequence from top to bottom, and determining coordinate information corresponding to the first row of pixels as an upper boundary of the vertical boundary in response to the number of black pixels of the first row of pixels and the continuous preset number of rows of pixels behind the first row of pixels being larger than a third preset threshold;
detecting the number of black pixels of the row line by line in sequence from bottom to top, and determining coordinate information corresponding to the pixels of the second row as the lower boundary of the vertical boundary in response to the number of the black pixels of the second row and the continuous preset number of the pixels of the row after the second row being larger than a fourth preset threshold.
4. The method according to claim 2, wherein for each image to be processed, performing binarization processing on the image to be processed to obtain a binary image of the target object, including:
carrying out matting processing on each image to be processed to obtain a foreground image which corresponds to each image to be processed and comprises the target object;
And carrying out binarization processing on each foreground image to obtain a binary image of the target object.
5. The method according to claim 2, wherein before the boundary searching is sequentially performed on the target object row by row and column by column based on the pixel values on the binary image, the method further comprises:
searching on the binary image according to a preset step length by using a sliding window with a preset scale based on the pixel value on the binary image;
in each sliding window, denoising the pixels included in the sliding window based on the sum of pixel values of the pixels included in the sliding window.
6. The method according to claim 1, wherein the performing a cropping process on the to-be-processed image according to the boundary range and the preset image scale to obtain a target image corresponding to each to-be-processed image includes:
determining the center point coordinates of the target object according to the boundary range;
and determining the center point coordinate of the target object as the center point coordinate of the target image, and cutting the image to be processed according to the preset image scale to obtain the target image.
7. The method of claim 1, wherein performing viewpoint fusion processing on each target image according to target viewpoint information to obtain a target light field image corresponding to the target viewpoint information, comprises:
inputting at least two images to be processed in the plurality of images to be processed into a pre-trained depth network model to obtain a depth map of the target object output by the depth network model;
performing viewpoint fusion processing on the depth map based on the target viewpoint information to obtain a target viewpoint depth map under a viewpoint corresponding to the target viewpoint information;
and inputting the target image, the target viewpoint depth map and the target viewpoint information into a pre-trained viewpoint fusion model to obtain the target light field image output by the viewpoint fusion model.
8. The method according to claim 1, characterized by being applied to the acquisition device; after performing viewpoint fusion processing on each target image according to the target viewpoint information to obtain a target light field image corresponding to the target viewpoint information, the method further comprises:
and sending the target light field image to the display device so that the display device renders and displays the target light field image.
9. A light field image processing apparatus, comprising:
an image acquisition module configured to acquire a plurality of images to be processed respectively acquired by a plurality of cameras provided on an acquisition device; the plurality of images to be processed are acquired images with different visual angles of the target object;
the boundary searching module is configured to determine the boundary range of the target object on each image to be processed;
the cutting processing module is configured to cut the image to be processed according to the boundary range and a preset image scale to obtain a target image corresponding to each image to be processed;
the viewpoint fusion module is configured to perform viewpoint fusion processing on each target image according to target viewpoint information to obtain a target light field image corresponding to the target viewpoint information; the target viewpoint information represents position information of eyes of an observer at the display device side.
10. An electronic device, comprising:
a processor; and
memory storing computer instructions for causing the processor to perform the method according to any one of claims 1 to 8.
11. A video communication system, comprising:
the display device comprises an image acquisition device and a first controller;
acquisition device comprising a plurality of cameras and a second controller, at least one of the first controller and the second controller being for performing the method according to any one of claims 1 to 8.
12. A storage medium having stored thereon computer instructions for causing a computer to perform the method according to any one of claims 1 to 8.
CN202310796486.3A 2023-06-30 2023-06-30 Light field image processing method and device Pending CN116823691A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310796486.3A CN116823691A (en) 2023-06-30 2023-06-30 Light field image processing method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310796486.3A CN116823691A (en) 2023-06-30 2023-06-30 Light field image processing method and device

Publications (1)

Publication Number Publication Date
CN116823691A true CN116823691A (en) 2023-09-29

Family

ID=88128968

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310796486.3A Pending CN116823691A (en) 2023-06-30 2023-06-30 Light field image processing method and device

Country Status (1)

Country Link
CN (1) CN116823691A (en)

Similar Documents

Publication Publication Date Title
US11960639B2 (en) Virtual 3D methods, systems and software
CN109615652B (en) Depth information acquisition method and device
CN109348119B (en) Panoramic monitoring system
CN109360235B (en) Hybrid depth estimation method based on light field data
US9300947B2 (en) Producing 3D images from captured 2D video
US10841558B2 (en) Aligning two images by matching their feature points
JPH08331607A (en) Three-dimensional display image generating method
US20150379720A1 (en) Methods for converting two-dimensional images into three-dimensional images
CN112207821B (en) Target searching method of visual robot and robot
RU2690757C1 (en) System for synthesis of intermediate types of light field and method of its operation
US20220342365A1 (en) System and method for holographic communication
CN113436130B (en) Intelligent sensing system and device for unstructured light field
Angot et al. A 2D to 3D video and image conversion technique based on a bilateral filter
WO2020196520A1 (en) Method, system and computer readable media for object detection coverage estimation
KR101841750B1 (en) Apparatus and Method for correcting 3D contents by using matching information among images
JPH0981746A (en) Two-dimensional display image generating method
GB2585197A (en) Method and system for obtaining depth data
Seitner et al. Trifocal system for high-quality inter-camera mapping and virtual view synthesis
CN116823691A (en) Light field image processing method and device
KR100489894B1 (en) Apparatus and its method for virtual conrtol baseline stretch of binocular stereo images
Gurrieri et al. Efficient panoramic sampling of real-world environments for image-based stereoscopic telepresence
CN111630569B (en) Binocular matching method, visual imaging device and device with storage function
CN109379577B (en) Video generation method, device and equipment of virtual viewpoint
CN114390267A (en) Method and device for synthesizing stereo image data, electronic equipment and storage medium
CN112102347A (en) Step detection and single-stage step height estimation method based on binocular vision

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination