CN112348828A

CN112348828A - Example segmentation method and device based on neural network and storage medium

Info

Publication number: CN112348828A
Application number: CN202011166214.8A
Authority: CN
Inventors: 苏浩; 潘武; 张小锋; 黄鹏; 胡彬; 林封笑
Original assignee: Zhejiang Dahua Technology Co Ltd
Current assignee: Zhejiang Dahua Technology Co Ltd
Priority date: 2020-10-27
Filing date: 2020-10-27
Publication date: 2021-02-09

Abstract

The invention discloses an example segmentation method and device based on a neural network and a storage medium. Wherein, the method comprises the following steps: obtaining a target picture in a video stream; inputting a target picture into a target example segmentation neural network, and outputting a first example set, wherein the example segmentation neural network comprises: the method comprises the steps that a network, a feature map processing layer and a mask processing layer are detected, the detection network is used for obtaining parameters of a boundary frame of an example, the feature map processing layer processes the parameters of the boundary frame to obtain target parameters, and the mask processing layer performs example segmentation on a target picture according to the target parameters; determining similar examples of the target examples in the first example set according to the overlapping degree of the target examples in the first example set; and determining the instances which are larger than the first preset threshold value in the similar instances to obtain at least one instance picture of the target instance in the target picture, thereby solving the technical problem of low instance segmentation calculation speed in the prior art.

Description

Example segmentation method and device based on neural network and storage medium

Technical Field

The invention relates to the technical field of image processing, in particular to an example segmentation method and device based on a neural network and a storage medium.

Background

When processing an image, it is often necessary to locate and distinguish the various instances contained in the picture. For example, different examples are framed by using a target detection method, and the regions where the examples of different types are located are marked pixel by using a semantic segmentation method, so that the examples of different types are distinguished. If the instances in the same category need to be further distinguished, the image is subjected to instance segmentation, and the instance segmentation not only can distinguish the categories of the image, but also can distinguish different instances in the same category.

The existing example segmentation architecture based on the candidate region is adopted to perform example segmentation on the picture in a prediction network with N levels so as to directly obtain an example segmentation result. The accuracy of example segmentation is improved in a cascading mode, but the reasoning speed is greatly reduced, and the speed and the precision are not balanced.

In view of the above problems, no effective solution has been proposed.

Disclosure of Invention

The embodiment of the invention provides an example segmentation method and device based on a neural network and a storage medium, which are used for at least solving the technical problem of low example segmentation calculation speed in the prior art.

According to an aspect of the embodiments of the present invention, there is provided an example segmentation method based on a neural network, including: acquiring a target picture in a video stream; inputting the target picture into a target instance segmentation neural network, and outputting a first instance set, wherein the instance segmentation neural network comprises: the method comprises the steps that a detection network, a feature map processing layer and a mask processing layer are used, the detection network is used for obtaining parameters of a boundary frame of an example, the feature map processing layer processes the parameters of the boundary frame to obtain target parameters, and the mask processing layer performs example segmentation on a target picture according to the target parameters; determining similar instances of the target instances in the first instance set according to the overlapping degree of the target instances in the first instance set; and determining the instances which are larger than a first preset threshold value in the similar instances to obtain at least one instance picture of the target instance in the target picture.

Optionally, before the target picture is input into the target instance segmentation neural network and the first instance set is output, the method includes: acquiring a sample picture set in a video stream; labeling a target object in each picture in the sample picture set to obtain a target data set; inputting the labeled data set into a preset example segmentation neural network, wherein the preset neural network comprises a preset detection network, a preset feature map processing layer, a preset mask processing layer and a target loss function, the detection network is used for acquiring parameters of example bounding boxes in preset sample pictures, the feature map processing layer processes the parameters of the example bounding boxes in the preset sample pictures to obtain preset target parameters, the mask processing layer performs example segmentation on the sample target pictures according to the preset target parameters, and the target loss function comprises a binary cross entropy loss function and an intersection-to-parallel ratio loss function; determining to segment a neural network for the instance if the target loss function satisfies a predetermined condition.

Optionally, labeling the target object in each picture in the sample picture set to obtain a target data set includes: and performing data enhancement on each picture and the labeling result in the sample picture set by adopting an example segmentation standard data enhancement technology to obtain the target data set.

Optionally, after labeling the target object in each picture in the sample picture set to obtain a target data set, the method further includes: and dividing the target data set into a training set, a verification set and a test set according to a preset proportion, wherein the training set is used for training the preset example segmentation neural network, the verification set is used for verifying the preset example segmentation neural network, and the test set is used for testing the preset neural network segmentation model.

Optionally, before the annotation data set is input into the preset instance segmentation neural network, the method further includes: the method comprises the steps that an initialization detection network is built, wherein the detection network comprises a feature extraction backbone network, a feature enhancement network and a detection head, the feature extraction backbone network is used for carrying out feature extraction on an example of each picture in a sample picture set to obtain a feature map, the feature enhancement network carries out feature map enhancement on the feature map and marks the size of the feature map, and the feature maps marked in different sizes are input to the detection head to obtain parameters of a sample example boundary frame; and constructing the preset example segmentation neural network according to the initialization detection network, a preset feature map processing layer and a preset mask processing, wherein the preset feature map processing layer processes parameters of the sample example bounding box to obtain sample target parameters, and the preset mask processing layer performs example segmentation on the sample target picture according to the sample target parameters.

According to another aspect of the embodiments of the present invention, there is also provided an example segmentation apparatus based on a neural network, including:

according to an aspect of the embodiments of the present invention, there is provided an example segmenting apparatus based on a neural network, including: the device comprises a first acquisition unit, a second acquisition unit and a third acquisition unit, wherein the first acquisition unit is used for acquiring a target picture in a video stream; an output unit, configured to input the target picture into a target instance segmentation neural network, and output a first instance set, where the instance segmentation neural network includes: the method comprises the steps that a detection network, a feature map processing layer and a mask processing layer are used, the detection network is used for obtaining parameters of a boundary frame of an example, the feature map processing layer processes the parameters of the boundary frame to obtain target parameters, and the mask processing layer performs example segmentation on a target picture according to the target parameters; a first determining unit, configured to determine similar instances of the target instances in the first instance set according to overlapping degrees between the target instances in the first instance set; a second determining unit, configured to determine an instance, which is greater than a first predetermined threshold, in the similar instance, to obtain at least one instance picture of the target instance in the target picture.

Optionally, the apparatus includes: the second acquisition unit is used for inputting the target picture into a target example segmentation neural network and acquiring a sample picture set in a video stream before outputting the first example set; the obtaining unit is used for marking the target object in each picture in the sample picture set to obtain a target data set; the input unit is used for inputting the labeling data set into a preset example segmentation neural network, wherein the preset neural network comprises a preset detection network, a preset feature map processing layer, a preset mask processing layer and a target loss function, the detection network is used for acquiring parameters of an example boundary box in a preset sample picture, the feature map processing layer processes the parameters of the example boundary box in the preset sample picture to obtain preset target parameters, the mask processing layer performs example segmentation on the sample target picture according to the preset target parameters, and the target loss function comprises a binary cross loss entropy function and an intersection-to-parallel ratio loss function; a third determining unit, configured to determine to segment the neural network for the instance if the target loss function satisfies a predetermined condition.

Optionally, the obtaining unit includes: and the obtaining module is used for performing data enhancement on each picture and the labeling result in the sample picture set by adopting an example segmentation standard data enhancement technology to obtain the target data set.

Optionally, the apparatus further comprises: and the dividing unit is used for marking the target object in each picture in the sample picture set to obtain a target data set, and then dividing the target data set into a training set, a verification set and a test set according to a preset proportion, wherein the training set is used for training the preset example segmentation neural network, the verification set is used for verifying the preset example segmentation neural network, and the test set is used for testing the preset neural network segmentation model.

Optionally, the apparatus further comprises: the first construction unit is used for constructing an initialized detection network before the labeled data set is input into a preset example segmentation neural network, wherein the detection network comprises a feature extraction backbone network, a feature enhancement network and a detection head, the feature extraction backbone network is used for performing feature extraction on an example of each picture in a sample picture set to obtain a feature map, the feature enhancement network performs feature map enhancement on the feature map and marks the size of the feature map, and the feature maps marked in different sizes are input into the detection head to obtain parameters of a sample example boundary box; and the second construction unit is used for constructing the preset example segmentation neural network according to the initialization detection network, a preset feature map processing layer and a preset mask processing layer, wherein the preset feature map processing layer processes the parameters of the sample example bounding box to obtain sample target parameters, and the preset mask processing layer performs example segmentation on the sample target picture according to the sample target parameters.

According to a further aspect of the embodiments of the present invention, there is also provided a computer-readable storage medium, in which a computer program is stored, wherein the computer program is configured to execute the above example neural network-based segmentation method when the computer program is executed.

According to another aspect of the embodiments of the present invention, there is also provided an electronic device, including a memory and a processor, wherein the memory stores a computer program, and the processor is configured to execute the example neural network-based segmentation method described above through the computer program.

In the embodiment of the invention, a target picture in a video stream is obtained; inputting a target picture into a target example segmentation neural network, and outputting a first example set, wherein the example segmentation neural network comprises: the method comprises the steps that a network, a feature map processing layer and a mask processing layer are detected, the detection network is used for obtaining parameters of a boundary frame of an example, the feature map processing layer processes the parameters of the boundary frame to obtain target parameters, and the mask processing layer performs example segmentation on a target picture according to the target parameters; determining similar examples of the target examples in the first example set according to the overlapping degree of the target examples in the first example set; and determining the example which is larger than the first preset threshold value in the similar example to obtain at least one example picture of the target example in the target picture, so that the aims of carrying out example segmentation on the target picture through an example segmentation neural network with a detection network, a feature map processing layer and a mask processing layer and determining the target example through the threshold value in the example segmentation result are fulfilled, the technical effect of rapidness and accuracy is achieved, and the technical problem that the example segmentation calculation speed is lower in the prior art is solved.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the invention and together with the description serve to explain the invention without limiting the invention. In the drawings:

FIG. 1 is a schematic diagram of an application environment of an alternative example neural network-based segmentation method according to an embodiment of the present invention;

FIG. 2 is a flow diagram of an alternative neural network-based example segmentation method in accordance with embodiments of the present invention;

FIG. 3 is an alternative example segmentation method according to an embodiment of the present invention;

FIG. 4 is a block diagram of an alternative example split network in accordance with an embodiment of the present invention;

FIG. 5 is a block diagram of an alternative masking layer in accordance with an embodiment of the present invention;

FIG. 6 is a block diagram of an alternative example neural network-based segmentation apparatus, according to an embodiment of the present invention;

fig. 7 is a schematic structural diagram of an alternative electronic device according to an embodiment of the invention.

Detailed Description

In order to make the technical solutions of the present invention better understood, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

It should be noted that the terms "first," "second," and the like in the description and claims of the present invention and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the invention described herein are capable of operation in sequences other than those illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

According to an aspect of the embodiments of the present invention, there is provided a neural network-based example segmentation method, optionally, as an optional implementation, the neural network-based example segmentation method may be applied, but is not limited, to the environment as shown in fig. 1. Including terminal equipment 102, network 104, and server 106.

Optionally, the example segmentation method based on the neural network may be executed by the terminal device 102, may also be executed by the server 106, and may also be completed by the terminal device 102 and the server 106 executing together.

The example segmentation method performed by the server 106 based on the neural network is described as follows.

The server 106 acquires a target picture in the video stream; inputting a target picture into a target example segmentation neural network, and outputting a first example set, wherein the example segmentation neural network comprises: the method comprises the steps that a network, a feature map processing layer and a mask processing layer are detected, the detection network is used for obtaining parameters of a boundary frame of an example, the feature map processing layer processes the parameters of the boundary frame to obtain target parameters, and the mask processing layer performs example segmentation on a target picture according to the target parameters; determining similar examples of the target examples in the first example set according to the overlapping degree of the target examples in the first example set; and determining the example which is larger than the first preset threshold value in the similar example to obtain at least one example picture of the target example in the target picture, so that the aims of carrying out example segmentation on the target picture through an example segmentation neural network with a detection network, a feature map processing layer and a mask processing layer and determining the target example through the threshold value in the example segmentation result are fulfilled, the technical effect of rapidness and accuracy is achieved, and the technical problem that the example segmentation calculation speed is lower in the prior art is solved.

Optionally, in this embodiment, the terminal device 102 may be a terminal device configured with a target client, and may include but is not limited to at least one of the following: mobile phones (such as Android phones, iOS phones, etc.), notebook computers, tablet computers, palm computers, MID (Mobile Internet Devices), PAD, desktop computers, smart televisions, etc. The target client may be a video client, an instant messaging client, a browser client, an educational client, etc. Such networks may include, but are not limited to: a wired network, a wireless network, wherein the wired network comprises: a local area network, a metropolitan area network, and a wide area network, the wireless network comprising: bluetooth, WIFI, and other networks that enable wireless communication. The server may be a single server, a server cluster composed of a plurality of servers, or a cloud server. The above is merely an example, and this is not limited in this embodiment.

Optionally, as an optional implementation manner, as shown in fig. 2, the example segmentation method based on a neural network includes:

in step S2020, a target picture in the video stream is acquired.

Step S204, inputting the target picture into a target example segmentation neural network, and outputting a first example set, wherein the example segmentation neural network comprises: the image processing method comprises a detection network, a feature map processing layer and a mask processing layer, wherein the detection network is used for obtaining parameters of a boundary frame of an example, the feature map processing layer processes the parameters of the boundary frame to obtain target parameters, and the mask processing layer performs example segmentation on a target image according to the target parameters.

Step S206, determining similar examples of the target examples in the first example set according to the overlapping degree of the target examples in the first example set.

Step S208, determining the instances which are larger than the first preset threshold value in the similar instances, and obtaining at least one instance picture of the target instance in the target picture.

Optionally, in this embodiment, the above scheme may include but is not limited to be applied to scenes such as portrait photographing, video special effects, AR scenes, automatic driving, video target tracking, and unmanned aerial vehicle video image processing, where performing target object tracking is to perform instance segmentation on pictures in a video stream, and perform video target tracking. Fast and accurate example segmentation is a good basis for proceeding to the next step.

According to the embodiment provided by the application, the target picture in the video stream is obtained; inputting a target picture into a target example segmentation neural network, and outputting a first example set, wherein the example segmentation neural network comprises: the method comprises the steps that a network, a feature map processing layer and a mask processing layer are detected, the detection network is used for obtaining parameters of a boundary frame of an example, the feature map processing layer processes the parameters of the boundary frame to obtain target parameters, and the mask processing layer performs example segmentation on a target picture according to the target parameters; determining similar examples of the target examples in the first example set according to the overlapping degree of the target examples in the first example set; and determining the example which is larger than the first preset threshold value in the similar example to obtain at least one example picture of the target example in the target picture, so that the aims of carrying out example segmentation on the target picture through an example segmentation neural network with a detection network, a feature map processing layer and a mask processing layer and determining the target example through the threshold value in the example segmentation result are fulfilled, the technical effect of rapidness and accuracy is achieved, and the technical problem that the example segmentation calculation speed is lower in the prior art is solved.

Optionally, before inputting the target picture into the target instance segmentation neural network and outputting the first instance set, the method may include: acquiring a sample picture set in a video stream; marking a target object in each picture in the sample picture set to obtain a target data set; inputting the labeled data set into a preset example segmentation neural network, wherein the preset neural network comprises a preset detection network, a preset feature map processing layer, a preset mask processing layer and a target loss function, the detection network is used for acquiring parameters of example boundary frames in preset sample pictures, the feature map processing layer processes the parameters of the example boundary frames in the preset sample pictures to obtain preset target parameters, the mask processing layer performs example segmentation on the sample target pictures according to the preset target parameters, and the target loss function comprises a binary cross entropy loss function and an intersection-to-parallel ratio loss function; in the case where the target loss function satisfies a predetermined condition, it is determined to segment the neural network for the instance.

Optionally, labeling the target object in each picture in the sample picture set to obtain the target data set includes: and performing data enhancement on each picture and the labeling result in the sample picture set by adopting an example segmentation standard data enhancement technology to obtain a target data set.

Optionally, after labeling the target object in each picture in the sample picture set to obtain the target data set, the method further includes: dividing a target data set into a training set, a verification set and a test set according to a preset proportion, wherein the training set is used for training a preset example segmentation neural network, the verification set is used for verifying the preset example segmentation neural network, and the test set is used for testing a preset neural network segmentation model.

Optionally, before the annotation data set is input into the preset instance segmentation neural network, the method further includes: the method comprises the steps that an initialized detection network is established, wherein the detection network comprises a feature extraction backbone network, a feature enhancement network and a detection head, the feature extraction backbone network is used for carrying out feature extraction on an example of each picture in a sample picture set to obtain a feature map, the feature enhancement network carries out feature map enhancement on the feature map and marks the size of the feature map, and the feature maps marked in different sizes are input to the detection head to obtain parameters of a sample example boundary frame; and constructing a preset example segmentation neural network according to the initialization detection network, a preset feature map processing layer and a preset mask processing layer, wherein the preset feature map processing layer processes parameters of a sample example boundary frame to obtain sample target parameters, and the preset mask processing layer performs example segmentation on the sample target picture according to the sample target parameters.

As an alternative embodiment, the present application further provides an instance partitioning method. As shown in fig. 3, a flow chart of an example segmentation method. The detailed description is as follows.

And step 31, initializing and preprocessing a video image to be detected.

Video image pre-processing, comprising: initializing a video image to be detected, denoted as X_V，X_VIs noted as the dimension of

X_VIs denoted by K_V(ii) a To-be-detected video image X_VAccording to a standard example segmentation labeling method, carrying out manual labeling on an interested target to obtain a labeling result, and recording the labeling result as G_V(ii) a Adopting an example segmentation standard data enhancement technology to treat a video image X to be detected_VAnd labeling the result G_VPerforming data enhancement to obtain a final data set, and marking as omega; initializing the proportion of the number of images in the training set, the verification set and the test set in the data set omega, and recording the proportion as K₁∶K₂∶K₃(ii) a The data set omega is in accordance with K₁∶K₂∶K₃The training set, the verification set and the test set are divided in proportion at random, and the obtained training set, verification set and test set are recorded as omega respectively_train，Ω_valid，Ω_test。

The standard example segmentation and annotation method refers to example segmentation aiming at predicting the position and semantic mask of each example in an image, and adopts open source software labelme for annotation.

The example segmentation standard data enhancement technology is used for expanding a data set by performing operations of turning, rotating, scaling, translating, adding Gaussian noise, performing contrast transformation, performing color transformation and the like on an image of the data set. The data enhancement is mainly to reduce the overfitting phenomenon of the network, and the network with stronger generalization ability can be obtained by transforming the training picture, so that the method is better suitable for application scenes.

Step 32, a convolutional neural detection network is constructed and initialized.

Constructing and initializing a standard convolutional neural detection network model, denoted as W, according to a standard YOLOv4 network construction method_DWherein W is_DThe system consists of a feature extraction backbone network, a feature enhancement network and a detection head, wherein the feature extraction backbone network is marked as W_BThe feature enhancement network is denoted as W_NThe detection head is marked as W_P(ii) a The result of a convolutional layer followed by a batch normalization layer and a modified linear unit with leakage (leakage ReLU) function is recorded as C_BL。

Wherein, W_BExtracting features by adopting a standard CSPDarknet53 network;

feature enhanced network W_NIn (1), W_BThe results of (a) are processed in parallel by using a maximum pooling mode of 1 × 1, 5 × 5, 9 × 9 and 13 × 13, and then tensor stitching operation is performed on feature maps of different scales after pooling, and the obtained result is recorded as S_PP(ii) a Will S_PPA cross-level local connection module (CSP module) is accessed later, and residual components in the CSP module are used as C_BLReplacing, and recording the obtained result as A₅(ii) a A is to be₅Is connected with a C_BLSequentially performing bilinear interpolation upsampling twice, and sequentially recording the result as A₄And A₃(ii) a A is to be₅，A₄，A₃Constructing according to a standard characteristic pyramid and a path enhancement structure, and replacing tensor addition in the path enhancement structure by tensor splicing(ii) a Respectively connecting the results of four times of tensor splicing operations in the characteristic pyramid and the path enhancement structure which construct the standard into a cross-level local connection module (CSP module), and using C as a residual component in the cross-level local connection module_BLReplacing the standard feature pyramid and the 1 × 1 convolution layer in the path enhancement structure with the standard CoordConv layer to obtain the final feature enhancement network

Enhancing features in a network

Is sequentially marked as P from large to small according to the size of the characteristic diagram₃，P₄,P₅；

At W_PIn (1), P is₃，P₄,P₅Then, a standard YOLOv3 detection head (definition 6) is accessed in sequence through convolution connection; replacing the first layer convolutional layer in the YOLOv3 detection header with a standard CoordConv layer; the center point adjustment formula in the YOLOv3 detection head is increased by a factor of alpha, namely x ═ s (g)_x+α·σ(p_x)-(α-1)/2)，y＝s·(g_y+α·σ(p_y) - (α -1)/2 wherein x, y are the bounding box center coordinates, σ is the Sigmoid function, s is the scale factor, σ (p)_x) And σ (p)_y) Is the center coordinate offset, g_xAnd g_yRepresenting the coordinates of the center of the real bounding box; initializing the coefficient alpha to 1.05 to obtain the final detection head

Extracting features into a backbone network W_BFeature enhanced network

And a detection head

Through convolution connection and initialization, a final detection model is obtained

The standard CSPDarknet53 is a Backbone structure generated by referring to the experience of CSPNet in 2019 on the basis of a Yolov3 Backbone network Darknet53, wherein the Backbone structure comprises 5 CSP modules (cross-level local connection modules); the accuracy of the YOLOv4 is improved by nearly 10 points compared with that of the YOLOv3, however, the speed is hardly reduced, and the YOLOv4 is a detection model with higher speed and better precision, and can complete training only by 1080Ti or 2080 Ti.

The standard feature pyramid and path enhancement structure is based on a feature pyramid framework, enhancing information propagation, which adds a bottom-up enhancement path, thereby improving propagation of low-level features. Each stage of the newly added third path takes as input the feature maps of the previous stage and processes them with a 3 x 3 convolutional layer. The output of the convolution is added to the same phase profiles of the top-down path using cross-connects and then the profiles are sent to the next phase.

The convolution operation in the deep learning of the standard CoordConv layer is degenerated by translation and the like, so that uniform convolution kernel parameters can be shared at different positions of an image, but the coordinates of the current feature in the image cannot be perceived in the convolution learning process. CoordConv represents the coordinates of the pixel points of the feature map by newly adding corresponding channels in the convolved input feature map, so that the detection precision can be improved by sensing the coordinates to a certain degree in the process of convolution learning, and therefore, the feature extraction is optimized under the condition that the calculated amount is hardly increased.

The standard YOLOv3 detection head, the YOLOv3 network, the YOLOv3 detection head and the method are characterized in that the YOLOv3 detection head is composed of a feature extraction network Darknet53 and a YOLOv3 detection head, and the YOLOv3 detection head is used for carrying out target detection through 3 feature graphs with different scales, so that the fine-grained features can be detected, and the method is favorable for detecting small targets.

Step 33, the instance segmentation network is constructed and initialized.

The convolutional neural detection network model obtained in step 32

In the method, a feature map preprocessing layer W is added_PreSum mask branch W_MTo obtain the final example segmentation model, denoted as W_ISAs shown in fig. 4, the example partitions the structure of the network.

Preprocessing the layer W on the feature map_PreIn accordance with

Using the formula s_areaCalculate the rectangular box area s for each target w h_areaWherein w and h are the width and height of the detection target, respectively; then, the obtained detection result is taken as a mask proposal to be distributed to the corresponding feature enhanced network according to the following rules

The method comprises the following steps:

is allocated to P₃Processing;

is allocated to P₄Processing;

is allocated to P₅Processing;

mask branching in feature enhanced networks based on the above-described assignment

Taking out the corresponding characteristic diagram for example segmentation; firstly, carrying out region of interest alignment (RoIAlign) operation on the extracted feature map, and then passing through n_cConvolution layer with convolution kernel size of 3 x 3, n_dDeconvolution layer sum with convolution kernel size of 2 x 2

The operation of the convolutional layer with convolution kernel size of 1 × 1, the result of segmentation is obtained and is denoted as R₁The channel dimension of each convolutional layer is C_DThen adding a short circuit path, i.e. passing through n_c-after the results of 1 convolutional layer with convolution kernel size 3 × 3 have been processed through a convolutional layer with convolution kernel size 3 × 3, reducing the channel dimension to C with a convolutional layer with convolution kernel size 1 × 1_DAnd/2, introducing a full connection layer, changing the operation into a vector, and changing the space size of the obtained vector to R through matrix dimension changing₁Keep consistent and the final result is recorded as R₂(ii) a R is to be₁And R₂Adding to obtain the final mask result, which is denoted as R_mask(ii) a Will detect the model

Feature map preprocessing layer W_PreSum mask branch W_MDirectly connecting and initializing according to the structure diagram to obtain a final example segmentation model W_IS(ii) a As shown in fig. 5, the structure of the mask processing layer.

Step 34, training and adjusting the instance segmentation network.

Initializing image processing batch size and mini batch size, and recording as BS and mini-BS respectively; initializing learning rate, and recording as LR; initializing weight attenuation rate and momentum, and respectively recording as WDR and MO; initializing a training period, and recording as epoch; the value sampled from the Gaussian distribution with the mean value of 0 and the variance of 1 is taken as an example segmentation model W_ISTo obtain initialized instance split network parameters W_old(ii) a Training set omega in step S31 by adopting standard example segmentation network training technology_trainAfter the video pictures are disordered in sequence at random, the small batches of the video pictures are sequentially introduced into the example segmentation model W obtained in the step S33_IS(ii) a Computing an instance segmentation model W_ISLoss function value of (D), denoted Loss_oldThe formula for calculating the loss function is as follows: loss ═ Loss_cls+Loss_conf+Loss_box+Loss_maskWherein Loss_cls,Loss_confAnd Loss_maskIs a binary cross entropy Loss function, Loss_boxIs a standard CIoU loss function; network parameters W are segmented for the examples using a standard stochastic gradient descent method_oldThe updating is carried out, and the updating is carried out,wherein standard exponential moving average techniques are introduced; i.e. W_EMA＝λ*W_EMA+(1-λ)*W_EMAInitializing parameter lambda to 0.998, and finally obtaining a new network parameter, which is marked as W_new(ii) a Will train the network W_ISThe resulting final model and parameters are recorded as

Compared with a Dropout algorithm, the DropBlock algorithm does not Drop in a characteristic point form when the Drop is characterized, but collects a certain region of the Drop, so that the DropBlock algorithm is more suitable for being applied to an example segmentation task to improve the generalization capability of a network; the method adopts Mosaic data enhancement in the standard example segmentation network training technology, and splices 4 pictures into one picture by using the modes of random scaling, random cutting and random arrangement so as to improve the performance of small and medium targets; in addition, if the loss function of the small object is lower than a certain threshold value in one iteration, the next iteration utilizes a splicing map, otherwise, normal image training is adopted; the image is also subjected to self-adaptive scaling in the standard example segmentation network training technology; the standard example segmentation network training technology also adopts CmBN, SAT self-confrontation training and other technologies to train the network.

According to the standard CIoU loss function, DIoU is more in line with a target box regression mechanism than GIou, the distance, the overlapping rate and the scale between a target and an anchor are taken into consideration, so that the target box regression becomes more stable, and the length-width ratio of three boundary box regression factors is taken into consideration by the CIoU on the basis of the DIoU, so that the result is more accurate.

Standard index moving average technique the index moving average technique refers to taking the average value of a parameter in the past as a new parameter; compared with the method of directly updating the parameters, the method has the advantages that the parameter learning process can be more gentle by adopting an exponential moving average mode, the influence of abnormal values on parameter updating can be effectively avoided, and the convergence effect of model training is improved.

And step 35, performing real-time instance segmentation on the video stream to be detected.

Initializing a video stream acquired by a camera in real time into a video stream to be detected, and recording the video stream as V; decoding the video stream V according to FFmpeg standard by adopting multithreading technology, and recording the decoding result

Will be provided with

The video image example segmentation model obtained in the step 34 is introduced

In (3), obtaining an output result R_result(ii) a R is to be_resultInhibition of overlap by standard Matrix NMS gave results

I.e. the final video stream instance segmentation result.

For example, when calculating the suppression coefficient for a certain prediction frame B, the Matrix NMS calculates IoU of all prediction frames with scores higher than B and the prediction frame B in a Matrix parallel manner, and then performs approximate estimation according to the IOU and the suppression probability of the prediction frames with scores higher than B to estimate the suppression coefficient of B, thereby implementing parallelized calculation Soft NMS, and avoiding reduction of inference speed while improving detection accuracy.

According to the embodiment, a video stream acquired by a camera in real time is initialized to be a video stream to be detected and marked as V; decoding the video stream V according to FFmpeg standard by adopting multithreading technology, and recording the decoding result

Will be provided with

I.e. the final video stream instance segmentation result. And thus, instance partitioning can be achieved quickly and accurately.

It should be noted that, for simplicity of description, the above-mentioned method embodiments are described as a series of acts or combination of acts, but those skilled in the art will recognize that the present invention is not limited by the order of acts, as some steps may occur in other orders or concurrently in accordance with the invention. Further, those skilled in the art should also appreciate that the embodiments described in the specification are preferred embodiments and that the acts and modules referred to are not necessarily required by the invention.

According to another aspect of the embodiments of the present invention, there is also provided a neural network-based example segmentation apparatus for implementing the neural network-based example segmentation method described above. As shown, 6 the example segmentation apparatus based on neural network includes: a first acquisition unit 61, an output unit 63, a first determination unit 65, and a second determination unit 67.

A first obtaining unit 61, configured to obtain a target picture in a video stream.

An output unit 63, configured to input the target picture into a target instance segmentation neural network, and output the first instance set, where the instance segmentation neural network includes: the image processing method comprises a detection network, a feature map processing layer and a mask processing layer, wherein the detection network is used for obtaining parameters of a boundary frame of an example, the feature map processing layer processes the parameters of the boundary frame to obtain target parameters, and the mask processing layer performs example segmentation on a target image according to the target parameters.

A first determining unit 65, configured to determine similar instances of the target instances in the first instance set according to the overlapping degrees between the target instances in the first instance set.

A second determining unit 67, configured to determine an instance of the similar instances that is greater than the first predetermined threshold, to obtain at least one instance picture of the target instance in the target picture.

By the embodiment provided by the application, the first obtaining unit 61 obtains a target picture in a video stream; the output unit 63 inputs the target picture into a target instance segmentation neural network, and outputs a first instance set, wherein the instance segmentation neural network includes: the method comprises the steps that a network, a feature map processing layer and a mask processing layer are detected, the detection network is used for obtaining parameters of a boundary frame of an example, the feature map processing layer processes the parameters of the boundary frame to obtain target parameters, and the mask processing layer performs example segmentation on a target picture according to the target parameters; the first determining unit 65 determines similar instances of the target instances in the first instance set according to the overlapping degrees between the target instances in the first instance set; second determining unit 67 determines the instances of the similar instances that are larger than the first predetermined threshold, resulting in at least one instance picture of the target instance in the target picture. The method achieves the purposes of carrying out instance segmentation on the target picture through the instance segmentation neural network with the detection network, the feature map processing layer and the mask processing layer and determining the target instance through the threshold value of the instance segmentation result, thereby achieving the technical effects of rapidness and accuracy and further solving the technical problem of lower instance segmentation calculation speed in the prior art.

As an alternative embodiment, the apparatus may include:

the second acquisition unit is used for inputting the target picture into the target example segmentation neural network and acquiring a sample picture set in the video stream before outputting the first example set;

the obtaining unit is used for marking the target object in each picture in the sample picture set to obtain a target data set;

the device comprises an input unit, a detection unit, a feature map processing layer, a mask processing layer and a target loss function, wherein the preset neural network comprises a preset detection network, a preset feature map processing layer, a preset mask processing layer and the target loss function;

and a third determination unit for determining the example segmented neural network in the case that the target loss function satisfies a predetermined condition.

As an alternative embodiment, the obtaining unit may include:

and the obtaining module is used for performing data enhancement on each picture and the labeling result in the sample picture set by adopting an example segmentation standard data enhancement technology to obtain a target data set.

As an alternative embodiment, the apparatus may include:

and the dividing unit is used for marking the target object in each picture in the sample picture set to obtain a target data set, and then dividing the target data set into a divided training set, a verification set and a test set according to a preset proportion, wherein the training set is used for training the preset example segmented neural network, the verification set is used for verifying the preset example segmented neural network, and the test set is used for testing the preset neural network segmentation model.

As an alternative embodiment, the apparatus may include:

the first construction unit is used for constructing an initialized detection network before the labeled data set is input into a preset example segmentation neural network, wherein the detection network comprises a feature extraction backbone network, a feature enhancement network and a detection head, the feature extraction backbone network is used for performing feature extraction on an example of each picture in the sample picture set to obtain a feature map, the feature enhancement network performs feature map enhancement on the feature map and marks the size of the feature map, and the feature maps marked in different sizes are input into the detection head to obtain parameters of a sample example boundary box;

and the second construction unit is used for constructing a preset example segmentation neural network according to the initialization detection network, the preset feature map processing layer and the preset mask processing, wherein the preset feature map processing layer processes the parameters of the sample example bounding box to obtain sample target parameters, and the preset mask processing layer performs example segmentation on the sample target picture according to the sample target parameters.

According to another aspect of the embodiments of the present invention, there is also provided an electronic device for implementing the example neural network-based segmentation method, where the electronic device may be a terminal device or a server shown in fig. 1. The present embodiment is described by taking the electronic device as an example of a server. As shown in fig. 7, the electronic device comprises a memory 702 and a processor 704, the memory 702 having stored therein a computer program, the processor 704 being arranged to perform the steps of any of the above-described method embodiments by means of the computer program.

Optionally, in this embodiment, the electronic device may be located in at least one network device of a plurality of network devices of a computer network.

Optionally, in this embodiment, the processor may be configured to execute the following steps by a computer program:

s1, acquiring a target picture in the video stream;

s2, inputting the target picture into a target example segmentation neural network, and outputting a first example set, wherein the example segmentation neural network comprises: the method comprises the steps that a network, a feature map processing layer and a mask processing layer are detected, the detection network is used for obtaining parameters of a boundary frame of an example, the feature map processing layer processes the parameters of the boundary frame to obtain target parameters, and the mask processing layer performs example segmentation on a target picture according to the target parameters;

s3, determining similar examples of the target examples in the first example set according to the overlapping degree of the target examples in the first example set;

and S4, determining the instances which are larger than the first preset threshold value in the similar instances, and obtaining at least one instance picture of the target instance in the target picture.

Alternatively, it can be understood by those skilled in the art that the structure shown in fig. 7 is only an illustration, and the electronic device may also be a terminal device such as a smart phone (e.g., an Android phone, an iOS phone, etc.), a tablet computer, a palmtop computer, a Mobile Internet Device (MID), a PAD, and the like. Fig. 7 is a diagram illustrating a structure of the electronic device. For example, the electronics may also include more or fewer components (e.g., network interfaces, etc.) than shown in FIG. 7, or have a different configuration than shown in FIG. 7.

The memory 702 may be used to store software programs and modules, such as program instructions/modules corresponding to the neural network based example segmentation method and apparatus in the embodiment of the present invention, and the processor 704 executes various functional applications and data processing by running the software programs and modules stored in the memory 702, so as to implement the above-mentioned neural network based example segmentation method. The memory 702 may include high-speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some examples, the memory 702 can further include memory located remotely from the processor 704, which can be connected to the terminal over a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof. The memory 702 may be specifically, but not limited to, used for storing information such as a target picture, a result of example segmentation of the target picture, and the like. As an example, as shown in fig. 7, the memory 702 may include, but is not limited to, the first obtaining unit 61, the output unit 63, the first determining unit 65, and the second determining unit 67 in the example neural network based segmentation apparatus. In addition, other module units in the example segmentation device based on the neural network may also be included, but are not limited to these, and are not described in detail in this example.

Optionally, the transmitting device 706 is used for receiving or sending data via a network. Examples of the network may include a wired network and a wireless network. In one example, the transmission device 706 includes a Network adapter (NIC) that can be connected to a router via a Network cable and other Network devices to communicate with the internet or a local area Network. In one example, the transmission device 706 is a Radio Frequency (RF) module, which is used to communicate with the internet in a wireless manner.

In addition, the electronic device further includes: a display 708, configured to display the picture to be example-segmented and the example segmentation result; and a connection bus 710 for connecting the respective module parts in the above-described electronic apparatus.

In other embodiments, the terminal device or the server may be a node in a distributed system, where the distributed system may be a blockchain system, and the blockchain system may be a distributed system formed by connecting a plurality of nodes through a network communication. Nodes can form a Peer-To-Peer (P2P, Peer To Peer) network, and any type of computing device, such as a server, a terminal, and other electronic devices, can become a node in the blockchain system by joining the Peer-To-Peer network.

According to an aspect of the application, a computer program product or computer program is provided, comprising computer instructions, the computer instructions being stored in a computer readable storage medium. The computer instructions are read by a processor of a computer device from a computer-readable storage medium, and the computer instructions are executed by the processor to cause the computer device to perform the method for neural network-based instance segmentation described above. Wherein the computer program is arranged to perform the steps of any of the above method embodiments when executed.

Alternatively, in the present embodiment, the above-mentioned computer-readable storage medium may be configured to store a computer program for executing the steps of:

s1, acquiring a target picture in the video stream;

Alternatively, in this embodiment, a person skilled in the art may understand that all or part of the steps in the methods of the foregoing embodiments may be implemented by a program instructing hardware associated with the terminal device, where the program may be stored in a computer-readable storage medium, and the storage medium may include: flash disks, Read-Only memories (ROMs), Random Access Memories (RAMs), magnetic or optical disks, and the like.

The above-mentioned serial numbers of the embodiments of the present invention are merely for description and do not represent the merits of the embodiments.

The integrated unit in the above embodiments, if implemented in the form of a software functional unit and sold or used as a separate product, may be stored in the above computer-readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes several instructions for causing one or more computer devices (which may be personal computers, servers, network devices, etc.) to execute all or part of the steps of the method according to the embodiments of the present invention.

In the above embodiments of the present invention, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.

In the several embodiments provided in the present application, it should be understood that the disclosed client may be implemented in other manners. The above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units is only one type of division of logical functions, and there may be other divisions when actually implemented, for example, a plurality of units or components may be combined or may be integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, units or modules, and may be in an electrical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The foregoing is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, various modifications and decorations can be made without departing from the principle of the present invention, and these modifications and decorations should also be regarded as the protection scope of the present invention.

Claims

1. An example segmentation method based on a neural network is characterized by comprising the following steps:

acquiring a target picture in a video stream;

inputting the target picture into a target instance segmentation neural network, and outputting a first instance set, wherein the instance segmentation neural network comprises: the method comprises the steps that a detection network, a feature map processing layer and a mask processing layer are used, the detection network is used for obtaining parameters of a boundary frame of an example, the feature map processing layer processes the parameters of the boundary frame to obtain target parameters, and the mask processing layer performs example segmentation on a target picture according to the target parameters;

determining similar instances of the target instances in the first instance set according to the overlapping degree of the target instances in the first instance set;

and determining the instances which are larger than a first preset threshold value in the similar instances to obtain at least one instance picture of the target instance in the target picture.

2. The method of claim 1, wherein before inputting the target picture into a target instance segmentation neural network and outputting the first set of instances, the method comprises:

acquiring a sample picture set in a video stream;

labeling a target object in each picture in the sample picture set to obtain a target data set;

inputting the labeled data set into a preset example segmentation neural network, wherein the preset neural network comprises a preset detection network, a preset feature map processing layer, a preset mask processing layer and a target loss function, the detection network is used for acquiring parameters of example bounding boxes in preset sample pictures, the feature map processing layer processes the parameters of the example bounding boxes in the preset sample pictures to obtain preset target parameters, the mask processing layer performs example segmentation on the sample target pictures according to the preset target parameters, and the target loss function comprises a binary cross entropy loss function and an intersection-to-parallel ratio loss function;

determining to segment a neural network for the instance if the target loss function satisfies a predetermined condition.

3. The method of claim 2, wherein labeling the target object in each picture in the sample picture set to obtain a target data set comprises:

and performing data enhancement on each picture and the labeling result in the sample picture set by adopting an example segmentation standard data enhancement technology to obtain the target data set.

4. The method of claim 2, wherein after labeling the target object in each picture in the sample picture set to obtain the target data set, the method further comprises:

and dividing the target data set into a training set, a verification set and a test set according to a preset proportion, wherein the training set is used for training the preset example segmentation neural network, the verification set is used for verifying the preset example segmentation neural network, and the test set is used for testing the preset neural network segmentation model.

5. The method of claim 2, wherein prior to inputting the annotated dataset into a preset instance segmentation neural network, the method further comprises:

the method comprises the steps that an initialization detection network is built, wherein the detection network comprises a feature extraction backbone network, a feature enhancement network and a detection head, the feature extraction backbone network is used for carrying out feature extraction on an example of each picture in a sample picture set to obtain a feature map, the feature enhancement network carries out feature map enhancement on the feature map and marks the size of the feature map, and the feature maps marked in different sizes are input to the detection head to obtain parameters of a sample example boundary frame;

and constructing the preset example segmentation neural network according to the initialization detection network, a preset feature map processing layer and a preset mask processing, wherein the preset feature map processing layer processes parameters of the sample example bounding box to obtain sample target parameters, and the preset mask processing layer performs example segmentation on the sample target picture according to the sample target parameters.

6. An example segmentation apparatus based on a neural network, comprising:

the device comprises a first acquisition unit, a second acquisition unit and a third acquisition unit, wherein the first acquisition unit is used for acquiring a target picture in a video stream;

an output unit, configured to input the target picture into a target instance segmentation neural network, and output a first instance set, where the instance segmentation neural network includes: the method comprises the steps that a detection network, a feature map processing layer and a mask processing layer are used, the detection network is used for obtaining parameters of a boundary frame of an example, the feature map processing layer processes the parameters of the boundary frame to obtain target parameters, and the mask processing layer performs example segmentation on a target picture according to the target parameters;

a first determining unit, configured to determine similar instances of the target instances in the first instance set according to overlapping degrees between the target instances in the first instance set;

a second determining unit, configured to determine an instance, which is greater than a first predetermined threshold, in the similar instance, to obtain at least one instance picture of the target instance in the target picture.

7. The apparatus of claim 6, wherein the apparatus comprises:

the second acquisition unit is used for inputting the target picture into a target example segmentation neural network and acquiring a sample picture set in a video stream before outputting the first example set;

the input unit is used for inputting the labeling data set into a preset example segmentation neural network, wherein the preset neural network comprises a preset detection network, a preset feature map processing layer, a preset mask processing layer and a target loss function, the detection network is used for acquiring parameters of an example boundary box in a preset sample picture, the feature map processing layer processes the parameters of the example boundary box in the preset sample picture to obtain preset target parameters, the mask processing layer performs example segmentation on the sample target picture according to the preset target parameters, and the target loss function comprises a binary cross loss entropy function and an intersection-to-parallel ratio loss function;

a third determining unit, configured to determine to segment the neural network for the instance if the target loss function satisfies a predetermined condition.

8. The apparatus of claim 7, wherein the obtaining unit comprises:

and the obtaining module is used for performing data enhancement on each picture and the labeling result in the sample picture set by adopting an example segmentation standard data enhancement technology to obtain the target data set.

9. The apparatus of claim 7, further comprising:

and the dividing unit is used for marking the target object in each picture in the sample picture set to obtain a target data set, and then dividing the target data set into a training set, a verification set and a test set according to a preset proportion, wherein the training set is used for training the preset example segmentation neural network, the verification set is used for verifying the preset example segmentation neural network, and the test set is used for testing the preset neural network segmentation model.

10. The apparatus of claim 7, further comprising:

the first construction unit is used for constructing an initialized detection network before the labeled data set is input into a preset example segmentation neural network, wherein the detection network comprises a feature extraction backbone network, a feature enhancement network and a detection head, the feature extraction backbone network is used for performing feature extraction on an example of each picture in a sample picture set to obtain a feature map, the feature enhancement network performs feature map enhancement on the feature map and marks the size of the feature map, and the feature maps marked in different sizes are input into the detection head to obtain parameters of a sample example boundary box;

and the second construction unit is used for constructing the preset example segmentation neural network according to the initialization detection network, a preset feature map processing layer and a preset mask processing layer, wherein the preset feature map processing layer processes the parameters of the sample example bounding box to obtain sample target parameters, and the preset mask processing layer performs example segmentation on the sample target picture according to the sample target parameters.

11. A computer-readable storage medium, comprising a stored program, wherein the program when executed performs the method of any of claims 1 to 5.