WO2020125495A1

WO2020125495A1 - Panoramic segmentation method, apparatus and device

Info

Publication number: WO2020125495A1
Application number: PCT/CN2019/124334
Authority: WO
Inventors: 张维桐; 张锲石; 程俊
Original assignee: 中国科学院深圳先进技术研究院
Priority date: 2018-12-17
Filing date: 2019-12-10
Publication date: 2020-06-25
Also published as: CN109801307A

Abstract

Disclosed is a panoramic segmentation method, comprising: acquiring an original image to be segmented; carrying out semantic segmentation on the original image, and by means of a measured distance learning method for an embedded space, carrying out example segmentation on the original image; taking a target and a background obtained by means of example segmentation as examples, and carrying out guidance by means of a semantic segmentation output image, such that the centers of examples of the embedded space repel each other, and pixels within an example range are attracted to the centers of the examples to segment the image; and using a clustering loss function to further distinguish between different examples in order to obtain a panoramic segmentation result. An embedded space clustering operation is performed based on semantic segmentation of the image to process all the pixels of the image, and the different examples are further distinguished by means of the loss function, thereby realizing an end-to-end panoramic segmentation network framework.

Description

Panorama segmentation method, device and equipment

Technical field

The present application belongs to the field of image processing, and particularly relates to a panoramic segmentation method, device and equipment.

Background technique

Panoramic segmentation, as an emerging field, has important application value and development prospects in many fields, such as security control, industrial robot applications, and automobile assisted driving. However, individual instances with different shapes, complex and diverse background environments, dynamically changing scenes between pedestrians and perspectives, strict requirements for real-time and stability of the system, etc. Semantic segmentation and instance segmentation can no longer meet the current needs, panoramic segmentation should be applied It is born to supplement and solve the current problems, and poses great challenges to the panoramic segmentation problem.

The current deep-learning-based panoramic segmentation methods mostly rely on the selection of candidate frame regions for measurement. There is no way to identify and segment all pixels or shared pixels, and the current panoramic segmentation method is usually a combination of multiple sub-networks. Frame.

technical problem

In view of this, embodiments of the present application provide a panoramic segmentation method, device, and equipment to solve the panoramic segmentation method in the prior art, which cannot identify and segment all pixels or shared pixels, and cannot achieve an end-to-end frame problem.

Technical solution

A first aspect of an embodiment of the present application provides a panoramic segmentation method. The panoramic segmentation method includes:

Obtain the original image to be segmented;

Performing semantic segmentation on the original image, and performing instance segmentation on the original image through a distance learning method of embedded space;

Take the target and background obtained by instance segmentation as examples, and guide through the output graph of semantic segmentation, so that the centers between the instances embedded in the space are mutually exclusive, and the pixels within the instance range are attracted to the center of the instance to segment the image;

The clustering loss function is used to further distinguish different instances, and the panoramic segmentation result is obtained.

With reference to the first aspect, in a first possible implementation manner of the first aspect, the step of semantically segmenting the original image includes:

The fully convolutional structure of the fully connected layer based on the VGG model is used as the skeleton framework, and the conditional random recursive neural network containing closely connected pairs is randomly selected as the final layer of the model to semantically segment the original image.

With reference to the first aspect or the first possible implementation manner of the first aspect, in a second possible implementation manner of the first aspect, the target and background obtained by segmenting an instance are used as examples, and guided by a semantic segmentation output map, To make the centers of the instances embedded in the space mutually exclusive, the pixels within the range of the instance are attracted to the center of the instance, and the steps of segmenting the image include:

Use the target and background obtained by segmenting the instance as an example to determine the center point of the instance;

According to the preset radius of repulsive force of the instance, the centers of the instances in the embedded space are mutually repelled, and the pixels of the instance are clustered according to the preset radius of attractive force.

With reference to the first aspect, in a third possible implementation manner of the first aspect, the clustering loss function is:

L=α·L _pull +β·L _push +γ·L _nor +θ·L _seg

among them:

S represents the number of calibrated clusters in the standard data, E _s represents all the elements contained in the cluster S, x _i represents the embedding space, μ represents all the clustering centers of S, |||| represents the depth space Distance, η _pull and η _push respectively represent the edge threshold of gravity and repulsion in the embedded space, N _s represents the number of pixels included in the S clustering instance, and α, β, γ, θ are adjustment parameters.

With reference to the first aspect, in a fourth possible implementation manner of the first aspect, the target and background obtained from the instance segmentation are taken as examples, and the output graph of the semantic segmentation is used to guide, so that the centers between the instances embedded in the space are mutually exclusive , The pixels within the scope of the instance are attracted to the center of the instance, and the steps of segmenting the image include:

The semantic calibration and mask in the original picture are generated according to the semantic segmentation, the instance of the multi-dimensional pixel embedding is generated by the instance segmentation of the embedding space, and the cluster fusion is performed by the depth metric space, and the aggregated segmented image is output.

A second aspect of an embodiment of the present application provides a panoramic segmentation device, the panoramic segmentation device includes:

The original image acquisition unit is used to acquire the original image to be segmented;

A segmentation unit, used for semantic segmentation of the original image, and for instance segmentation of the original image through a distance learning method of embedded space;

The fusion unit is used to take the target and background obtained by instance segmentation as an instance, and guide through the semantic segmentation output map, so that the centers between the instances embedded in the space are mutually exclusive, and the pixels within the instance range are attracted to the center of the instance to segment the image ;

The loss training unit is used to further distinguish different instances by using a clustering loss function to obtain a panoramic segmentation result.

With reference to the second aspect, in a first possible implementation manner of the second aspect, the fusion unit includes:

The instance determination subunit is used to determine the center point of the instance by using the target and background obtained from the instance segmentation as the instance;

The clustering unit is used to repel each instance center of the embedded space according to a preset radius of repulsive force of the instance, and to cluster pixels of the instance according to a preset radius of attractive force.

With reference to the second aspect, in a second possible implementation manner of the second aspect, the clustering loss function is:

L=α·L _pull +β·L _push +γ·L _nor +θ·L _seg

among them:

A third aspect of the embodiments of the present application provides a panoramic segmentation device, including a memory, a processor, and a computer program stored in the memory and executable on the processor, and the processor executes the computer program To realize the steps of the panoramic segmentation method according to any one of the first aspect.

A fourth aspect of the embodiments of the present application provides a computer-readable storage medium that stores a computer program, and when the computer program is executed by a processor, the panoramic view described in any one of the first aspect is implemented The steps of the segmentation method.

Beneficial effect

Compared with the prior art, the embodiments of the present application have the following beneficial effects: after obtaining the original image to be segmented, the original image is semantically segmented based on the semantic segmented image, and the original image is measured by the metric distance learning method of the embedded space Semantic segmentation, embedding spatial clustering operations based on semantic segmentation images, can process all pixels of the image, and further distinguish different instances through the loss function, thus achieving an end-to-end panoramic segmentation network framework.

BRIEF DESCRIPTION

In order to more clearly explain the technical solutions in the embodiments of the present application, the following will briefly introduce the drawings used in the embodiments or the description of the prior art. Obviously, the drawings in the following description are only for the application In some embodiments, for those of ordinary skill in the art, without paying creative labor, other drawings may be obtained based on these drawings.

1 is a schematic diagram of an implementation process of a panoramic segmentation method provided by an embodiment of the present application;

2 is a schematic structural diagram of a panoramic segmentation system provided by an embodiment of the present application;

3 is a schematic structural diagram of an example of an embedded space provided by an embodiment of the present application;

4 is a schematic diagram of a panoramic segmentation device provided by an embodiment of the present application;

FIG. 5 is a schematic diagram of a panoramic segmentation device provided by an embodiment of the present application.

Embodiments of the invention

In the following description, for the purpose of illustration rather than limitation, specific details such as specific system structures and technologies are proposed to thoroughly understand the embodiments of the present application. However, those skilled in the art should understand that the present application can also be implemented in other embodiments without these specific details. In other cases, detailed descriptions of well-known systems, devices, circuits, and methods are omitted to avoid unnecessary details hindering the description of the present application.

In order to explain the technical solutions described in the present application, the following will be described with specific embodiments.

FIG. 1 is a schematic diagram of an implementation process of a panoramic segmentation method provided by an embodiment of the present application, and details are as follows:

In step S101, obtain the original image to be divided;

The original image to be divided may be a single picture or an image sequence in a video.

In step S102, perform semantic segmentation on the original image, and perform instance segmentation on the original image by using the metric distance learning method of the embedded space;

Semantic segmentation is used to group or categorize all pixels in the picture according to the semantic meaning expressed.

In this application, FCN based on VGG model (full name in English: fully connected layer, full name in Chinese is fully connected layer) can be used as a skeleton framework to include conditional random fields of recurrent neural networks with closely connected pairs As the final layer of the model.

Based on this, the fully convolutional structure and design of the fully connected layer based on the VGG model can further improve the segmentation quality of the semantic segmentation pixel level, and calibrate the pixel standard between detection and output. It is worth noting that the whole process is a field process that can be deduced differentially. This application adopts the output of this process as an identifier for instance detection in the next stage.

In order to further analyze the semantics of the target and background obtained by semantic segmentation, we designed a second branch for instance segmentation that introduces embedded space. At present, mainstream methods are based on segmentation and recognition after candidate frame detection. Then this method of segmentation and recognition based on candidate frame detection is not suitable for the panoramic segmentation task of this application, which is the original intention of this application for improvement and the defects concerned. Therefore, we adopt the metric distance learning method based on embedding space, so that it is not only easy to embed standard feed-forward network, but also can achieve end-to-end frame application.

In step S103, the target and background obtained by instance segmentation are used as an instance, and the output of the semantic segmentation is used to guide the center between the instances in the embedded space. The pixels within the instance range are attracted to the center of the instance to segment the image. ;

The semantic calibration and mask in the original image are generated through semantic segmentation, and the N-dimensional pixel embedding is generated through the instance segmentation processing in the embedding space, so that the instance segmentation can match the output of the semantic segmentation well. Highly clustered segmented images.

In this application, we can regard the target and background obtained by semantic segmentation as an instance. Under the guidance of the output graph of semantic segmentation, we need to achieve two goals between and within each instance of the embedded space:

(1) Inter-instance repulsive force: The centers between the instances embedded in the space are mutually repulsive.

(2) Attraction within an instance: attract the relevant pixel embedding point to the center of the instance within the scope of an instance

Correspondingly, we set the threshold of the action distance for the repulsive force and attractive force, that is, the repulsive force radius and attractive force radius of the instance. As shown in FIG. 2, after determining the center point of the instance, by setting the radius of the repulsive force of the instance, when the distance between the center points of the instances embedded in the space is less than twice the repulsive radius, the two The instance centers are mutually exclusive, making the instance segmentation more accurate; when the distance between the pixels embedded in the space and the center point of the instance is less than the attractive radius, the pixels will be clustered by the attractive force.

In this way, the central point of the instance in the embedded space does not cause too much attraction to pixels in other central domains in the range of action, and sufficient repulsion is ensured between multiple central points, so that there will not be too much or too little. Negative effect. In addition, this can ensure that the embedded pixels are as close as possible to the center without independent points. Appropriate restraint and relaxation achieve the clustering effect in the embedded space as shown in Figure 2. Through multiple iterations, a clustering algorithm is used to obtain pixel-level segmentation, and a panoramic segmentation method based on semantic overlay examples is achieved.

In step S104, a clustering loss function is used to further distinguish different instances to obtain a panoramic segmentation result.

Integrating multi-tasks and modules into an end-to-end framework, when implementing the task of panoramic segmentation, the joint loss needs to be calculated. By using the clustering loss function, the branch of the instance can be better trained. Therefore, our loss function can focus on the instance segmentation and embedding space. The loss function may be:

L=α·L _pull +β·L _push +γ·L _nor +θ·L _seg

among them:

In addition, we can set up a regularization process to ensure that the iterative calculation will not exceed the space too much, and use the common cross entropy calculation loss in the divided part, which can achieve good results. If the repulsion radius threshold is greater than or equal to 5 times the gravitational radius threshold, the iterative process can be ended, and the iterative efficiency of the system can be improved. As a preferred embodiment of the present application, through random gradient descent, α and β are 1, γ is 0.001, and θ is 0.7.

It should be understood that the size of the sequence numbers of the steps in the above embodiments does not mean the order of execution, and the execution order of each process should be determined by its function and internal logic, and should not constitute any limitation on the implementation process of the embodiments of the present application.

FIG. 3 is a schematic diagram of a panoramic segmentation framework provided by an embodiment of the present application. As shown in FIG. 3, after inputting the original image to be segmented into an end-to-end network framework, the end-to-end network framework The original image is shared and decoded. The decoded data is divided into two branches for processing. The lower semantic segmentation branch is used to train the semantic calibration and mask in the generated picture. The upper branch is divided by the instance of the embedded space to generate N( N is a natural number)-dimensional pixel embedding, so that the instance segmentation branch can match the output of semantic segmentation well. Then the two branches are merged, and under the guidance of the semantic segmentation output graph, each instance of the embedded space is subjected to repulsive iteration, and the iterative calculation of attraction within the instance is obtained to obtain pixel-level segmentation and achieve a panoramic view based on semantic overlay instances segmentation.

Then, through the set loss function, further distinguish different examples, realize the end-to-end panoramic classification network framework, and output panoramic segmented images.

FIG. 4 is a schematic structural diagram of a panoramic segmentation device according to an embodiment of the present application. Details are as follows:

The panoramic segmentation device includes:

The original image obtaining unit 401 is used to obtain an original image to be divided;

A segmentation unit 402 is used to perform semantic segmentation on the original image, and perform instance segmentation on the original image by using the metric distance learning method of the embedded space;

The fusion unit 403 is used to take the target and background obtained by instance segmentation as examples, and guide through the semantic segmentation output map, so that the centers between the instances embedded in the space are mutually exclusive, and the pixels within the range of the instance are attracted to the center of the instance to perform the image segmentation;

The loss training unit 404 is used to further distinguish different instances by using a clustering loss function to obtain a panoramic segmentation result.

Preferably, the fusion unit includes:

Preferably, the clustering loss function is:

L=α·L _pull +β·L _push +γ·L _nor +θ·L _seg

among them:

The panoramic segmentation device shown in FIG. 4 corresponds to the panoramic segmentation method shown in FIG. 1.

FIG. 5 is a schematic diagram of a panoramic segmentation device provided by an embodiment of the present application. As shown in FIG. 5, the panoramic segmentation device 5 of this embodiment includes: a processor 50, a memory 51, and a computer program 52 stored in the memory 51 and executable on the processor 50, for example, a panoramic segmentation program. When the processor 50 executes the computer program 52, the steps in the foregoing embodiments of the panoramic segmentation method are implemented. Alternatively, when the processor 50 executes the computer program 52, the functions of the modules/units in the foregoing device embodiments are realized.

Exemplarily, the computer program 52 may be divided into one or more modules/units, and the one or more modules/units are stored in the memory 51 and executed by the processor 50 to complete This application. The one or more modules/units may be a series of computer program instruction segments capable of performing specific functions, and the instruction segments are used to describe the execution process of the computer program 52 in the panoramic segmentation device 5. For example, the computer program 52 may be divided into:

The panoramic segmentation device 5 may be a computing device such as a desktop computer, a notebook, a palmtop computer and a cloud server. The panoramic segmentation device may include, but is not limited to, the processor 50 and the memory 51. Those skilled in the art may understand that FIG. 5 is only an example of the panoramic segmentation device 5 and does not constitute a limitation on the panoramic segmentation device 5, and may include more or fewer components than the illustration, or a combination of certain components, or different Components, for example, the panoramic segmentation device may also include an input and output device, a network access device, a bus, and the like.

The so-called processor 50 may be a central processing unit (Central Processing Unit, CPU), or other general-purpose processors, digital signal processors (Digital Signal Processor, DSP), application specific integrated circuits (Application Specific Integrated Circuit, ASIC), Ready-made programmable gate array (Field-Programmable Gate Array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc. The general-purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

The memory 51 may be an internal storage unit of the panoramic segmentation device 5, such as a hard disk or a memory of the panoramic segmentation device 5. The memory 51 may also be an external storage device of the panoramic segmentation device 5, for example, a plug-in hard disk equipped on the panoramic segmentation device 5, a smart memory card (Smart, Media, Card, SMC), and a secure digital (Secure Digital, SD) card, flash memory card (Flash Card), etc. Further, the memory 51 may also include both an internal storage unit of the panoramic segmentation device 5 and an external storage device. The memory 51 is used to store the computer program and other programs and data required by the panoramic segmentation device. The memory 51 can also be used to temporarily store data that has been or will be output.

Those skilled in the art can clearly understand that, for convenience and conciseness of description, only the above-mentioned division of each functional unit and module is used as an example for illustration. In practical applications, the above-mentioned functions may be allocated by different functional units, Module completion means that the internal structure of the device is divided into different functional units or modules to complete all or part of the functions described above. The functional units and modules in the embodiment may be integrated into one processing unit, or each unit may exist alone physically, or two or more units may be integrated into one unit. The above integrated unit may use hardware It can also be implemented in the form of software functional units. In addition, the specific names of each functional unit and module are only for the purpose of distinguishing each other, and are not used to limit the protection scope of the present application. For the specific working processes of the units and modules in the above system, reference may be made to the corresponding processes in the foregoing method embodiments, which will not be repeated here.

In the above embodiments, the description of each embodiment has its own emphasis. For a part that is not detailed or recorded in an embodiment, you can refer to the related descriptions of other embodiments.

Those of ordinary skill in the art may realize that the units and algorithm steps of the examples described in conjunction with the embodiments disclosed herein can be implemented by electronic hardware, or a combination of computer software and electronic hardware. Whether these functions are executed in hardware or software depends on the specific application of the technical solution and design constraints. Professional technicians can use different methods to implement the described functions for each specific application, but such implementation should not be considered beyond the scope of this application.

In the embodiments provided in this application, it should be understood that the disclosed device/terminal device and method may be implemented in other ways. For example, the device/terminal device embodiments described above are only schematic. For example, the division of the module or unit is only a logical function division, and in actual implementation, there may be another division manner, such as multiple units Or components can be combined or integrated into another system, or some features can be ignored or not implemented. In addition, the displayed or discussed mutual coupling or direct coupling or communication connection may be indirect coupling or communication connection through some interfaces, devices or units, and may be in electrical, mechanical or other forms.

The units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, they may be located in one place, or may be distributed on multiple network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in each embodiment of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units may be integrated into one unit. The above integrated unit can be implemented in the form of hardware or software function unit.

If the integrated module/unit is implemented in the form of a software functional unit and sold or used as an independent product, it may be stored in a computer-readable storage medium. Based on this understanding, the present application can implement all or part of the processes in the methods of the above embodiments, or it can be completed by a computer program instructing relevant hardware. The computer program can be stored in a computer-readable storage medium. When the program is executed by the processor, the steps of the foregoing method embodiments may be implemented. . Wherein, the computer program includes computer program code, and the computer program code may be in a source code form, an object code form, an executable file, or some intermediate form, etc. The computer-readable medium may include: any entity or device capable of carrying the computer program code, a recording medium, a USB flash drive, a mobile hard disk, a magnetic disk, an optical disc, a computer memory, and a read-only memory (ROM, Read-Only Memory) , Random Access Memory (RAM, Random Access Memory), electrical carrier signals, telecommunications signals and software distribution media, etc. It should be noted that the content contained in the computer-readable medium can be appropriately increased or decreased according to the requirements of legislation and patent practice in jurisdictions. For example, in some jurisdictions, according to legislation and patent practice, computer-readable media Excluded are electrical carrier signals and telecommunications signals. The above-mentioned embodiments are only used to illustrate the technical solutions of the present application, not to limit them; although the present application has been described in detail with reference to the foregoing embodiments, persons of ordinary skill in the art should understand that they can still implement the foregoing The technical solutions described in the examples are modified, or some of the technical features are equivalently replaced; and these modifications or replacements do not deviate from the spirit and scope of the technical solutions of the embodiments of the present application. Within the scope of protection of this application.

Claims

A panoramic segmentation method, characterized in that the panoramic segmentation method includes:

Obtain the original image to be segmented;

Performing semantic segmentation on the original image, and performing instance segmentation on the original image through a distance learning method of embedded space;

Take the target and background obtained by instance segmentation as examples, and guide through the output graph of semantic segmentation, so that the centers between the instances embedded in the space are mutually exclusive, and the pixels within the instance range are attracted to the center of the instance to segment the image;

The clustering loss function is used to further distinguish different instances, and the panoramic segmentation result is obtained.
The panoramic segmentation method according to claim 1, wherein the step of performing semantic segmentation on the original image includes:

The fully convolutional structure of the fully connected layer based on the VGG model is used as the skeleton framework, and the conditional random recursive neural network containing closely connected pairs is randomly selected as the final layer of the model to semantically segment the original image.
The panoramic segmentation method according to claim 1 or 2, wherein the target and background obtained by segmenting an instance are used as an instance, guided by a semantic segmentation output image, so that the centers between the instances embedded in the space are mutually exclusive, The pixels within the scope of the instance are attracted to the center of the instance, and the steps of segmenting the image include:

Use the target and background obtained by segmenting the instance as an example to determine the center point of the instance;

According to the preset radius of repulsive force of the instance, the centers of the instances in the embedded space are mutually repelled, and the pixels of the instance are clustered according to the preset radius of attractive force.
The panoramic segmentation method according to claim 1, wherein the clustering loss function is:

L=α·L pull +β·L push +γ·L nor +θ·L seg

among them:

S represents the number of calibrated clusters in the standard data, E s represents all the elements contained in the cluster S, x i represents the embedding space, μ represents all the clustering centers of S, |||| represents the depth space Distance, η pull and η push respectively represent the edge threshold of gravity and repulsion in the embedded space, N s represents the number of pixels included in the S clustering instance, and α, β, γ, θ are adjustment parameters.
The panoramic segmentation method according to claim 1, characterized in that the target and background obtained by segmenting an instance are used as an instance, and guided by a semantic segmentation output image, so that the centers between the instances embedded in the space are mutually exclusive, and the instance range The pixels inside are attracted to the center of the instance, and the steps to segment the image include:

The semantic calibration and mask in the original picture are generated according to the semantic segmentation, the instance of the multi-dimensional pixel embedding is generated by the instance segmentation of the embedding space, and the cluster fusion is performed by the depth metric space, and the aggregated segmented image is output.
A panoramic segmentation device, characterized in that the panoramic segmentation device includes:

The original image acquisition unit is used to acquire the original image to be segmented;

A segmentation unit, used for semantic segmentation of the original image, and for instance segmentation of the original image through a distance learning method of embedded space;

The fusion unit is used to take the target and background obtained by instance segmentation as an instance, and guide through the semantic segmentation output map, so that the centers between the instances embedded in the space are mutually exclusive, and the pixels within the instance range are attracted to the center of the instance to segment the image ;

The loss training unit is used to further distinguish different instances by using a clustering loss function to obtain a panoramic segmentation result.
The panoramic segmentation device according to claim 6, wherein the fusion unit comprises:

The instance determination subunit is used to determine the center point of the instance by using the target and background obtained by the instance segmentation as the instance;

The clustering unit is used to repel each instance center of the embedded space according to a preset radius of repulsive force of the instance, and to cluster pixels of the instance according to a preset radius of attractive force.
The panoramic segmentation device according to claim 6, wherein the clustering loss function is:

L=α·L pull +β·L push +γ·L nor +θ·L seg

among them:

S represents the number of calibrated clusters in the standard data, E s represents all the elements contained in the cluster S, x i represents the embedding space, μ represents all the clustering centers of S, |||| represents the depth space Distance, η pull and η push respectively represent the edge threshold of gravity and repulsion in the embedded space, N s represents the number of pixels included in the S clustering instance, and α, β, γ, θ are adjustment parameters.
A panoramic segmentation device, including a memory, a processor, and a computer program stored in the memory and runable on the processor, characterized in that, when the processor executes the computer program, it is implemented as claimed in claim 1. To any one of the steps of the panoramic segmentation method described in any one of 5.
A computer-readable storage medium storing a computer program, characterized in that, when the computer program is executed by a processor, the steps of the panoramic segmentation method according to any one of claims 1 to 5 are implemented .