CN111862106A

CN111862106A - Image processing method based on light field semantics, computer device and storage medium

Info

Publication number: CN111862106A
Application number: CN201910360375.1A
Authority: CN
Inventors: 刘睿洋
Original assignee: Yaoke Intelligent Technology Shanghai Co ltd
Current assignee: Yaoke Intelligent Technology Shanghai Co ltd
Priority date: 2019-04-30
Filing date: 2019-04-30
Publication date: 2020-10-30
Anticipated expiration: 2039-04-30
Also published as: CN111862106B

Abstract

According to the image processing method, the computer device and the storage medium based on the light field semantics, a focus image stack under a preset view angle in an original light field is established according to original light field data; wherein the focal image stack is composed of a set of focal images of different depths; analyzing semantic information of each instance contained in each focus picture in the focus image stack to form a semantic focus image stack; according to the quality of different focus pictures belonging to the same instance in the semantic focus image stack, selecting a target focus picture with the highest quality to represent the instance so as to form a target focus picture set; propagating semantic information of instances in the target focal spot picture set to the original lightfield; the scheme of the application can realize efficient representation of instance semantics and improve business application value.

Description

Image processing method based on light field semantics, computer device and storage medium

Technical Field

The present application relates to the field of image processing technologies, and in particular, to an image processing method, a computer device, and a storage medium based on light field semantics.

Background

Image semantic segmentation is one of the fundamental tasks of computer vision. The task takes a two-dimensional image as input, divides an object region in the image through a visual algorithm, and identifies the content in the object region, namely, the semantic category of each pixel point of the image is determined while the continuity of the image region is ensured. The traditional segmentation method is mainly based on statistical methods such as conditional random fields, random forests and the like to construct classifiers, and after deep learning, great progress is made in the segmentation problem while efficient image classification is realized by using a convolutional neural network. Meanwhile, with the development of multi-view geometry, more and more researchers fuse stereoscopic vision information into the traditional monocular vision algorithm flow.

However, the existing image semantic segmentation algorithm, especially the further instance segmentation algorithm (InstanceSegmentation), can distinguish different target objects of the same type, and the algorithm is more complicated than the general semantic segmentation algorithm, so how to improve the computation efficiency is a technical problem to be solved urgently in the industry.

Disclosure of Invention

In view of the above-mentioned drawbacks of the prior art, it is an object of the present application to provide an image processing method, a computer device, and a storage medium based on light field semantics, which solve the problems of the prior art through algorithm optimization.

In order to achieve the above and other related objects, the present application provides an image processing method based on light field semantics, comprising: establishing a focal image stack under a preset visual angle in an original light field according to the original light field data; wherein the focal image stack is composed of a set of focal images of different depths; analyzing semantic information of each instance contained in each focus picture in the focus image stack to form a semantic focus image stack; according to the quality of different focus pictures belonging to the same instance in the semantic focus image stack, selecting a target focus picture with the highest quality to represent the instance so as to form a target focus picture set; propagating semantic information of instances in the target focal spot picture set to the original lightfield.

In an embodiment of the present application, the raw light field data is obtained from one or more image groups captured by the camera array corresponding to a scene at one or more time points simultaneously.

In an embodiment of the present application, each of the focal point images is obtained by performing depth-wise average sampling in a light field model of a scene corresponding to the original light field input data.

In an embodiment of the present application, the analyzing semantic information of each instance included in each focal image in the focal image stack includes: and carrying out example segmentation corresponding to each focus picture to obtain an example semantic image area and corresponding semantic confidence of each example.

In an embodiment of the present application, the method of instance partitioning includes: mask R-CNN, SDS, HyperColumns, CFM, Deep & Sharp Mask, MNC, ISFCN, FCIS, SIS, and PAN.

In an embodiment of the present application, a method for determining that instances in different focal point images belong to the same object includes: and performing cluster analysis on the categories of the objects to which the instances belong according to the similarity between the boundary boxes of the same instance in different focal point pictures and the depth difference between the different focal point pictures of the same instance.

In one embodiment of the present application, the depth difference is measured by gaussian distance.

In an embodiment of the application, the selecting, according to qualities of different focal point pictures belonging to the same instance in the semantic focal point image stack, a target focal point picture with a highest quality to represent the instance includes: acquiring the mass fraction of each example in each focal point picture; and each quality score is obtained by comprehensively evaluating the semantic confidence and the definition of each instance semantic image area in each focus image respectively.

In an embodiment of the present application, the sharpness includes an evaluation of differences of pixel values of pixels in the extracted semantic image area at different viewing angles.

In an embodiment of the present application, the different viewing angles include the other viewing angles which are the same as the preset viewing angle.

In an embodiment of the present application, the method further includes: associating and storing depth information and semantic information of an instance of each target focus picture set to a semantic representation set, wherein the semantic information is associated with a semantic mask; the propagating semantic information of instances in the target focal spot picture set to the original lightfield comprises: and re-projecting the semantic mask corresponding to each instance semantic image area in each target focus image set to the corresponding instance under the target view angle according to the depth information.

In an embodiment of the present application, the re-projecting includes: and for each current pixel in the target view angle, finding the corresponding pixel on the target focus picture set with different depths in the semantic representation set under the preset view angle, and selecting the semantic mask value of the pixel on the target focus picture with the minimum depth information and the semantic not belonging to the background classification as the semantic mask value of the current pixel.

To achieve the above and other related objects, the present application provides a computer apparatus comprising: one or more memories for storing computer program instructions; one or more processors configured to execute the computer program instructions to perform operations comprising: establishing a focal image stack under a preset visual angle in an original light field according to the original light field data; wherein the focal image stack is composed of a set of focal images of different depths; analyzing semantic information of each instance contained in each focus picture in the focus image stack to form a semantic focus image stack; according to the quality of different focus pictures belonging to the same instance in the semantic focus image stack, selecting a target focus picture with the highest quality to represent the instance so as to form a target focus picture set; propagating semantic information of instances in the target focal spot picture set to the original lightfield.

In an embodiment of the present application, the apparatus further performs a method comprising: the method for judging the instances in different focal point pictures belong to the same object comprises the following steps: and performing cluster analysis on the categories of the objects to which the instances belong according to the similarity between the boundary boxes of the same instance in different focal point pictures and the depth difference between the different focal point pictures of the same instance.

To achieve the above and other related objects, the present application provides a non-transitory computer storage medium storing computer program instructions which, when executed by one or more processors, perform operations comprising: establishing a focal image stack under a preset visual angle in an original light field according to the original light field data; wherein the focal image stack is composed of a set of focal images of different depths; analyzing semantic information of each instance contained in each focus picture in the focus image stack to form a semantic focus image stack; and according to the quality of different focus pictures belonging to the same instance in the semantic focus image stack, selecting a target focus picture with the highest quality to represent the instance so as to form a target focus picture set.

As described above, according to the image processing method, the computer apparatus, and the storage medium based on light field semantics, a focal image stack at a preset viewing angle in an original light field is established according to original light field data; wherein the focal image stack is composed of a set of focal images of different depths; analyzing semantic information of each instance contained in each focus picture in the focus image stack to form a semantic focus image stack; according to the quality of different focus pictures belonging to the same instance in the semantic focus image stack, selecting a target focus picture with the highest quality to represent the instance so as to form a target focus picture set; propagating semantic information of instances in the target focal spot picture set to the original lightfield; the scheme of the application can realize efficient representation of instance semantics and improve business application value.

Drawings

Fig. 1 shows a schematic diagram of a light field 4D model.

Fig. 2 is a schematic structural diagram of a light field camera implemented by a camera array according to an embodiment of the present application.

Fig. 3 is a block diagram of an image processing system according to an embodiment of the present application.

Fig. 4 is a schematic structural diagram of a computer device according to an embodiment of the present application.

Fig. 5 is a flowchart illustrating an image processing method based on light field semantics according to an embodiment of the present application.

Detailed Description

The following description of the embodiments of the present application is provided by way of specific examples, and other advantages and effects of the present application will be readily apparent to those skilled in the art from the disclosure herein. The present application is capable of other and different embodiments and of being practiced or of being carried out in various ways, and its several details are capable of modification in various respects, all without departing from the spirit and scope of the present application. It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict.

Embodiments of the present application will be described in detail below with reference to the accompanying drawings so that those skilled in the art to which the present application pertains can easily carry out the present application. The present application may be embodied in many different forms and is not limited to the embodiments described herein.

In order to clearly explain the present application, components that are not related to the description are omitted, and the same reference numerals are given to the same or similar components throughout the specification.

Throughout the specification, when a component is referred to as being "connected" to another component, this includes not only the case of being "directly connected" but also the case of being "indirectly connected" with another element interposed therebetween. In addition, when a component is referred to as "including" a certain constituent element, unless otherwise stated, it means that the component may include other constituent elements, without excluding other constituent elements.

When an element is referred to as being "on" another element, it can be directly on the other element, or intervening elements may also be present. When a component is referred to as being "directly on" another component, there are no intervening components present.

Although the terms first, second, etc. may be used herein to describe various elements in some instances, these elements should not be limited by these terms. These terms are only used to distinguish one element from another. For example, the first interface and the second interface, etc. are described. Also, as used herein, the singular forms "a", "an" and "the" are intended to include the plural forms as well, unless the context indicates otherwise. It will be further understood that the terms "comprises," "comprising," "includes" and/or "including," when used in this specification, specify the presence of stated features, steps, operations, elements, components, items, species, and/or groups, but do not preclude the presence, or addition of one or more other features, steps, operations, elements, components, species, and/or groups thereof. The terms "or" and/or "as used herein are to be construed as inclusive or meaning any one or any combination. Thus, "A, B or C" or "A, B and/or C" means "any of the following: a; b; c; a and B; a and C; b and C; A. b and C ". An exception to this definition will occur only when a combination of elements, functions, steps or operations are inherently mutually exclusive in some way.

The terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application. As used herein, the singular forms "a", "an" and "the" include plural forms as long as the words do not expressly indicate a contrary meaning. The term "comprises/comprising" when used in this specification is taken to specify the presence of stated features, regions, integers, steps, operations, elements, and/or components, but does not exclude the presence or addition of other features, regions, integers, steps, operations, elements, and/or components.

Terms indicating "lower", "upper", and the like relative to space may be used to more easily describe a relationship of one component with respect to another component illustrated in the drawings. Such terms are intended to include not only the meanings indicated in the drawings, but also other meanings or operations of the device in use. For example, if the device in the figures is turned over, elements described as "below" other elements would then be oriented "above" the other elements. Thus, the exemplary terms "under" and "beneath" all include above and below. The device may be rotated 90 or other angles and the terminology representing relative space is also to be interpreted accordingly.

Although not defined differently, including technical and scientific terms used herein, all terms have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. Terms defined in commonly used dictionaries are to be additionally interpreted as having meanings consistent with those of related art documents and the contents of the present prompts, and must not be excessively interpreted as having ideal or very formulaic meanings unless defined.

Light field, as the name implies, relates to the distribution of a certain physical quantity of light in space.

As shown in FIG. 1, the light field model can be represented in a simplified manner by rays that intersect Two Parallel Planes (TPP), i.e., u-v and s-t.

The coordinates of the intersection points of the light rays and the two planes are (s, t) and (u, v), respectively, that is, each four-dimensional data u, v, s, t can uniquely represent one light ray, and the light field is formed by each light ray in the space, which can be represented as LF (u, v, s, t).

Thus, any two parallel optics through which a ray of light passes in sequence can be characterized using the bi-planar method.

In order to reconstruct the light field, the light field data needs to be acquired, and the apparatus for acquiring the light field data includes, for example, a light field Camera (LF Camera), and there are several implementation manners of the structure of the light field Camera, for example, a microlens array is disposed between a main lens of the Camera and an imaging sensor, or a Camera array, and corresponding st plane and uv plane, such as the main lens and the microlens array, the main lens and the sensor image plane, the Camera lens array and the imaging sensor array, and the like in the light field Camera.

Fig. 2 shows a schematic structural diagram of a camera array in an embodiment.

A camera array in which a plurality of cameras are arranged in an array form such as M × N (M, N is greater than 0), as the name implies; multiple cameras may simultaneously capture different images of a scene, and the collection of these images may analyze the light field data.

For example, the camera array 200 shown in the figures is 3 x 3, although this is not a limitation. The distance between adjacent camera lenses 201 in each row is called as a baseline (baseline), and the wider the baseline is, the more the image of the object behind the obstacle can be acquired, that is, the wide baseline camera array can enhance the capability of deblocking.

Of course, it should be noted that the baseline width and the camera array structure can be set according to the actual size requirement of the scene (scene) to be image-captured.

In conjunction with the above, each camera lens in the camera array may be located in the (s, t) plane, while the imaging sensor of each camera is located in the (u, v) plane.

Fig. 3 is a schematic structural diagram of an image processing system according to an embodiment of the present application.

In this application scenario, an image pickup device 301 and a computer device 302 are provided.

The camera 301 may be implemented by the structure of the light field camera in the foregoing embodiments, for example, including one or more camera arrays, the structure of which may be as described in the foregoing embodiments, each camera array including a plurality of cameras; a camera array in the camera device 301 collects light field data (e.g., one or more sets of pictures in which multiple cameras collect); the computer device 302 is communicatively connected to the camera device 301 to enable the camera device 301 to transmit light field data to the computer device 302 and/or the computer device 302 to send control instructions to the camera device 301.

In some examples, the communication connection may be a wired connection through an electrical line, such as a connection through a USB interface, an HDMI interface, or the like of a counterpart to a standard line.

In some examples, the communication connection may also be a wireless connection, for example, a connection through a wireless communicator of the opposite end, such as WiFi, bluetooth, mobile communication module (2G/3G/4G/5G), and the like.

In some examples, the communication connection may also be a network connection, i.e., a long-range communication connection over a local area network and/or the internet.

The computer device 302 is configured to perform image processing using the light field data as an input and output a desired image result.

The computer device 302 may vary in its specific implementation according to the application scenarios of the above various embodiments, for example, in some examples, the computer device 302 may be integrated as a component in the same device as the camera device 301; alternatively, in some examples, the computer device 302 may be located in a different device from the camera device 301, for example, the computer device 302 may be implemented in an electronic device such as a desktop computer, a server/server group, a notebook computer, a tablet computer, or a smart phone, and is communicatively connected to the camera device 301 through a wired or wireless communication manner.

The computer apparatus 400 comprises: one or more communicators 401, one or more memories 402, and one or more processors 403.

In the present embodiment, the number of components shown in the drawings is only an example, and is not limited thereto.

The communicator 401 is a communicator for communicating with an external device (for example, the camera device in the foregoing embodiments or other devices capable of providing light field data), and may be implemented by an interface circuit for communicating in a wired connection (for example, USB, HDMI) manner or a wireless connection (for example, WiFi, 2G/3G/4G/5G) manner.

The memory 402 for storing computer instructions;

the processor 403, coupled to the communicator 401 and the memory 402, is configured to execute the computer program instructions to implement the desired image processing functions.

The memory 402 may include, but is not limited to, high speed random access memory, non-volatile memory. Such as one or more magnetic disk storage devices, flash memory devices, or other non-volatile solid-state storage devices.

The processor 403 may be a general-purpose processor, and includes a Central Processing Unit (CPU), a Network Processor (NP), and the like; the integrated circuit may also be a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other programmable logic device, discrete gate or transistor logic device, or discrete hardware components.

It should be noted that, in some embodiments, if the computer device 400 has stored light field data in advance, it may also work off-line without including a communicator.

The computer device is capable of implementing a light field semantics based image processing method by executing computer program instructions.

As shown in fig. 5, a schematic flow chart of the image processing method based on light field semantics in the embodiment of the present application is shown.

The method may be implemented by software, for example, by the computer program instructions stored in the memory of the embodiment of fig. 4, when executed by the processor.

The method comprises the following steps:

step S501: and establishing a focal image stack under a preset visual angle in the original light field according to the original light field data.

In one or more embodiments, the predetermined viewing angle is, for example, a viewing angle of any camera in the camera array, such as a central viewing angle; the focal image stack is composed of a group of focal images with different depths formed by refocusing the original light field data under the preset visual angle.

Optionally, the depth intervals between the focus images in the focus image stack may be the same, that is, each focus image is obtained by sampling along the depth in the light field model of the scene corresponding to the original light field input data.

Assuming that the focal image stack contains N focal images, the size of the focal image stack is N, within a predefined coarse scene depth range [ d ]_smin，d_smax]In this, we use average sampling to generate N equidistant focus pictures S_i∈[S₀，S₁…，S_N]. And the depth of the focal spot pattern can be represented by the following formula (1):

step S502: semantic information of each instance contained in each focus picture in the focus image stack is analyzed to form a semantic focus image stack.

In one or more embodiments, since each focal point image corresponds to only one depth, it can be considered as non-occlusion, and since objects of incorrect depth can be removed, blurred, or irregularly mixed, a semantically meaningful portion of each detection object can be obtained after mixing; otherwise, if from a different object of the scene, the refocused pixel will be an irregular mixture of different objects whose semantic structure is destroyed, an instance (instance) being used to represent the rendering of the object in the image.

In each focal point picture, a part corresponding to an example needs to be segmented, and the segmented image part corresponding to the example can be an example semantic image area.

For example, although both the pedestrian a and the pedestrian B in the image are pedestrians, they belong to different individuals, i.e., instances, and we should segment them from the image in a differentiated manner.

Specifically, the instance splitting process may be performed using an instance splitting method.

In one or more embodiments, the method of instance partitioning comprises: mask R-CNN, SDS, HyperColumns, CFM, Deep & Sharp Mask, MNC, ISFCN, FCIS, SIS, and PAN.

Optionally, instance segmentation is performed on each focal point image to obtain an instance semantic image region and a corresponding semantic confidence of each instance.

For example, suppose that from each focal point picture S_iCorresponding to K instances toAn example semantic image region P and corresponding mask are taken as

M_iIs shown as

And generates semantic confidence for each instance K e K

Represented as set C_iSpecifically, the following formula (2) can be expressed:

step S503: and according to the quality of different focus pictures belonging to the same instance in the semantic focus image stack, selecting a target focus picture with the highest quality to represent the instance so as to form a target focus picture set.

Since the pure semantics cannot identify the correlation between the examples in different focal point images, it is necessary to determine the correlation between the focal point images, that is, it is necessary to determine whether the examples in different focal point images belong to the same object.

In one or more embodiments, the categories of the objects to which the instances belong may be subjected to cluster analysis according to the similarity between the bounding boxes of each same instance in different focal point pictures and the depth difference between different focal point pictures in which each same instance is located, and each category obtained by clustering corresponds to one object.

For example, clustering refers to clustering detected objects on different focusing pictures, and unsupervised clustering means such as spectral clustering; and the judgment of the correlation of the above example can be made by the following formula (3):

wherein d represents the depth of the picture stack in which the object is located. The first item is to evaluate the similarity between the corresponding bounding boxes according to the example of the focal point picture, and if the similarity is found, the description is possible to belong to the same example; also, since the same object may appear on different focus pictures and show some degree of defocus, the second term is to model the depth difference between different objects by gaussian distance.

Consider that each cluster represents the same instance, denoted

Alternatively, if two or more instances are found to be at very close depths, i.e., the difference in depth between the two is less than a depth threshold, after the above processing, their depths may be set to the average of the sum of the depths of each other.

In addition, in order to express the semantics of the instance with the minimum data amount, the quality score of the instance in each focal point picture can be calculated, and then the focal point picture with the highest quality score is selected to be used for describing the semantics of the instance.

For focal point picture S_iIn the above example k, the bounding box (bounding box) of the object detection example k is

And corresponding S_iSemantic mask of example k

Corresponding semantic confidence of

The quality of the instance on the focus stack is determined by Q (k, S) in combination with its semantic confidence and degree of focus_i) Weighing:

wherein, W_sAnd W_fIs a weight that is adjustable in the sense that,

indicating the degree of clarity of the example semantic image region.

In one embodiment of the present invention, the substrate is,

the calculation formula of (a) is as follows:

wherein,

wherein P is focal point picture S_iTo the image region of instance k;

the mask corresponding to k is represented by a binary matrix of 0, 1, which represents a modulo operation on a pixel-by-pixel basis. For the phenomena of ghost in the light field refocusing, etc., the method introduces a defocus (p) to evaluate the difference of pixel values of the pixels taken under different viewing angles to distinguish whether the phenomena of ghost, etc. occur.

In one or more embodiments, it may be preferable to determine the difference between the pixel value of the preset viewing angle and the pixel corresponding to the sub-viewing angle closest to the preset viewing angle.

For example, non-uniform samples with large pixel differences can be excluded by an unsupervised clustering algorithm such as mean shift clustering.

In one embodiment, the calculation of defocus (p) is as follows:

wherein(s)₀，t₀) For a viewpoint of a predetermined viewing angle on the s-t plane,(s)_i，t_i) Are the point of view of other angles of view,

is(s)₀，t₀) And the pixel p under the view angle corresponds to the pixel set under other view angles.

In the result obtained by the clustering and representing each object, the target focus picture containing the highest quality Q, namely the peak value is selected to represent the corresponding example k, and the focus depth d of the target example focus picture_iAnd semantic information may be saved, the depth may be used as the depth of the corresponding instance.

Saving depth d of each instance_iAn example depth set D of k instances of the focal image stack is obtained_opt＝[d₁,…,d_k]According to D_optA target focal point image set S can be generated_optWherein, if the algorithm is based on MASK R-CNN, S_optIncluded at each depth d_kFocal point picture S_kAnd semantic mask M of the corresponding instance_kIs represented by

Step S504: propagating semantic information of instances in the target focal spot picture set to the original lightfield.

Through the above process, we can obtain the approximate depth d of the instance with semantic information under the preset view angle _iAnd a rough positional relationship, whereby it is possible to vary the depth d_iAnd obtaining the semantic corresponding relation of the example under different visual angles through reprojection.

When the parameters (internal reference and external reference matrix, or external reference only, etc.) of the virtual camera at the preset view angle and the target view angle and the depth information are known, a reprojection transformation matrix from the preset view angle (for example, the central view angle) to the target view angle is obtained as H, and then the conversion relationship of the pixels of the preset view angle corresponding to the pixels at the target view angle is as follows:

preset angle of view i.e.(s)₀，t₀) The target view angle is(s)_i，t_i)，

And p_(u,v)I.e. the pixels associated at the preset viewing angle and the target viewing angle, respectively.

Mask M with reprojection process corresponding to preset viewing angle_opt(set to include the merging of semantic masks for the respective instances at the preset view.) each instance in the instance mask is projected back to the target view to obtain the semantic mask for the instance at the target view, and all semantic masks M are merged at the target view_(s,t)Expressed as the following formula:

wherein S is_optThe number i corresponds to the number of instances,

shown is a semantic mask, M, for each instance under the target view (s, t)_(s,t)Then the merging of semantic masks for each instance under the target view (s, t) is represented.

After clustering, each class instance has a respective depth value d_kIf the depth value is used as an independent variable and the quality score Q of the focus picture is used as a dependent variable, the result should approximate a single-peak function; thus, under the condition of setting a quality score threshold, a depth range can be defined around the depth corresponding to the quality score peak through the quality score interval from the corresponding quality score threshold to the quality score peak, and the search amount of instance matching between the preset visual angle and the target visual angle can be limited through the depth range when the re-projection operation from the preset visual angle to the target visual angle is carried out.

Further optionally, the focal point image stack may also be obtained at the target view angle by the above method, and then a semantic analysis process similar to the focal point image stack at the preset view angle is performed to obtain the corresponding depth, semantic information, and the like of each instance, and the similarity degree is compared with a semantic mask with similar depth which is reprojected from the preset view angle, thereby determining an accurate mapping relationship of each instance at different view angles.

Due to the fact that the target focus image selection process is carried out according to the quality scores, the high-efficiency representation of the light field instance semantics can be achieved, and the value of commercial application is improved.

In summary, according to the image processing method, the computer device, and the storage medium based on the light field semantics, a focal image stack at a preset view angle in an original light field is established according to original light field data; wherein the focal image stack is composed of a set of focal images of different depths; analyzing semantic information of each instance contained in each focus picture in the focus image stack to form a semantic focus image stack; according to the quality of different focus pictures belonging to the same instance in the semantic focus image stack, selecting a target focus picture with the highest quality to represent the instance so as to form a target focus picture set; propagating semantic information of instances in the target focal spot picture set to the original lightfield; the scheme of the application can realize efficient representation of instance semantics and improve business application value.

The above embodiments are merely illustrative of the principles and utilities of the present application and are not intended to limit the application. Any person skilled in the art can modify or change the above-described embodiments without departing from the spirit and scope of the present application. Accordingly, it is intended that all equivalent modifications or changes which can be made by those skilled in the art without departing from the spirit and technical concepts disclosed in the present application shall be covered by the claims of the present application.

Claims

1. An image processing method based on light field semantics is characterized by comprising the following steps:

establishing a focal image stack under a preset visual angle in an original light field according to the original light field data; wherein the focal image stack is composed of a set of focal images of different depths;

analyzing semantic information of each instance contained in each focus picture in the focus image stack to form a semantic focus image stack;

according to the quality of different focus pictures belonging to the same instance in the semantic focus image stack, selecting a target focus picture with the highest quality to represent the instance so as to form a target focus picture set;

propagating semantic information of instances in the target focal spot picture set to the original lightfield.

2. The method of claim 1, wherein each of the focal images is sampled equally along a depth in a light field model of a scene to which the raw light field input data corresponds.

3. The method of claim 1, wherein analyzing semantic information for each instance contained in each focal spot image in the focal image stack comprises:

and carrying out example segmentation corresponding to each focus picture to obtain an example semantic image area and corresponding semantic confidence of each example.

4. The method of claim 3, wherein the instance partitioning method comprises: mask R-CNN, SDS, HyperColumns, CFM, Deep & Sharp Mask, MNC, ISFCN, FCIS, SIS, and PAN.

5. The method according to claim 1, wherein the method of judging that the instances in different focal point images belong to the same object comprises:

and performing cluster analysis on the categories of the objects to which the instances belong according to the similarity between the boundary boxes of the same instance in different focal point pictures and the depth difference between the different focal point pictures of the same instance.

6. The method of claim 5, wherein the depth difference is measured by a Gaussian distance.

7. The method according to claim 3, wherein selecting a target focal point image with the highest quality to represent the instance according to the quality of different focal point images belonging to the same instance in the semantic focal point image stack comprises:

acquiring the mass fraction of each example in each focal point picture; and each quality score is obtained by comprehensively evaluating the semantic confidence and the definition of each instance semantic image area in each focus image respectively.

8. The method of claim 7, wherein the sharpness comprises an evaluation of differences in pixel values of pixels in the extracted example semantic image region at different viewing angles.

9. The method of claim 8, wherein the different views comprise other views to which the preset view is closest.

10. The method of claim 1, wherein said propagating semantic information of instances in said target focal spot picture set to said original lightfield comprises:

associating and storing depth information and semantic information of an instance of each target focus picture set to a semantic representation set, wherein the semantic information is associated with a semantic mask;

and re-projecting the semantic mask corresponding to each instance semantic image area in each target focus image set to the corresponding instance under the target view angle according to the depth information.

11. The method of claim 10, wherein the re-projecting comprises: and for each current pixel in the target view angle, finding the corresponding pixel on the target focus picture set with different depths in the semantic representation set under the preset view angle, and selecting the semantic mask value of the pixel on the target focus picture with the minimum depth information and the semantic not belonging to the background classification as the semantic mask value of the current pixel.

12. A computer device, comprising: one or more memories for storing computer program instructions; one or more processors configured to execute the computer program instructions to perform operations comprising:

13. The apparatus of claim 12, wherein the raw light field data is derived from one or more image sets captured by the camera array corresponding to a scene at one or more time points simultaneously.

14. The apparatus of claim 12, wherein the analyzing semantic information for each instance contained in each focal spot image in the focal image stack comprises:

15. The apparatus of claim 14, wherein the method for instance partitioning comprises: mask R-CNN, SDS, HyperColumns, CFM, Deep & Sharp Mask, MNC, ISFCN, FCIS, SIS, and PAN.

16. The apparatus of claim 12, further comprising: the method for judging the instances in different focal point pictures belong to the same object comprises the following steps:

17. The apparatus according to claim 16, wherein said selecting a target focal point image with the highest quality to represent said instance according to the quality of different focal point images belonging to the same instance in said semantic focal point image stack comprises:

18. The apparatus of claim 17, wherein the sharpness comprises an evaluation of differences in pixel values of pixels in the extracted example semantic image region at different viewing angles.

19. The apparatus of claim 12, wherein said propagating semantic information for instances in the target focal spot picture set to the original lightfield comprises:

20. The apparatus of claim 19, wherein the re-projecting comprises: and for each current pixel in the target view angle, finding the corresponding pixel on the target focus picture set with different depths in the semantic representation set under the preset view angle, and selecting the semantic mask value of the pixel on the target focus picture with the minimum depth information and the semantic not belonging to the background classification as the semantic mask value of the current pixel.

21. A non-transitory computer storage medium storing computer program instructions that, when executed by one or more processors, perform operations comprising:

and according to the quality of different focus pictures belonging to the same instance in the semantic focus image stack, selecting a target focus picture with the highest quality to represent the instance so as to form a target focus picture set.