CN114494712A

CN114494712A - Object extraction method and device

Info

Publication number: CN114494712A
Application number: CN202011238741.5A
Authority: CN
Inventors: 黄伟杰; 田文善; 康勇
Original assignee: Navinfo Co Ltd
Current assignee: Navinfo Co Ltd
Priority date: 2020-11-09
Filing date: 2020-11-09
Publication date: 2022-05-13

Abstract

The embodiment of the application provides an object extraction method and device, and the method comprises the following steps: a first picture comprising at least one object is input to a semantic segmentation feature network such that the semantic segmentation feature network outputs a first feature map. And processing the first characteristic graph according to the semantic segmentation classification network to obtain a semantic segmentation result graph, wherein the semantic segmentation result graph comprises a graph layer of an object mask graph and a graph layer of a background mask graph. And processing the image layers of the first characteristic image and the object mask image according to the interest point generating network to obtain at least one interest point. And processing the first characteristic graph and the at least one interest point according to the interest point control network and the instance selection network to obtain a target mask graph corresponding to the at least one interest point. And respectively determining the position of each object in the first picture according to the target mask image corresponding to each interest point so as to extract at least one object. The embodiment effectively ensures the effect and accuracy of the extracted rod-shaped object.

Description

Object extraction method and device

Technical Field

The embodiment of the application relates to computer technologies, and in particular, to an object extraction method and device.

Background

In the process of mapping, it is usually necessary to extract rod-shaped objects in the image information, such as light poles, telegraph poles, and the like.

At present, when extracting a rod object in image information, the related art may first detect the boundary of a bounding box of a single object in an image, then complete semantic segmentation of a foreground and a background by using a segmentation method, and finally combine the boundary of the bounding box of the single object and a result of the semantic segmentation of the foreground and the background to extract each rod object in the image.

However, since the shape of the rod-shaped object is slender, the ratio of the length to the width of the bounding box and the size of the bounding box are not uniform, the accuracy of extracting the object by the above-mentioned bounding box detection method is poor.

Disclosure of Invention

The embodiment of the application provides an object extraction method and device, and aims to solve the problem that the accuracy of object extraction is poor.

In a first aspect, an embodiment of the present application provides an object extraction method, including:

inputting a first picture comprising at least one object into a semantic segmentation feature network, so that the semantic segmentation feature network outputs a first feature map, wherein the first feature map is used for indicating overall features of the first picture;

processing the first characteristic graph according to a semantic segmentation classification network to obtain a semantic segmentation result graph, wherein the semantic segmentation result graph comprises a graph layer of an object mask graph and a graph layer of a background mask graph;

processing the image layers of the first characteristic image and the object mask image according to an interest point generation network to obtain at least one interest point, wherein one interest point corresponds to one object in the first image;

processing the first feature map and the at least one interest point according to an interest point control network and an instance selection network to obtain a target mask map corresponding to the at least one interest point, wherein the target mask map is used for indicating the position of an object corresponding to the interest point in the first picture;

and respectively determining the position of each object in the first picture according to a target mask image corresponding to each interest point so as to extract the at least one object.

In a possible design, the processing, according to the point-of-interest generation network, the layers of the first feature map and the object mask map to obtain at least one point of interest includes:

inputting the first feature map into an interest point generation network, so that the interest point generation network outputs a second feature map, wherein the second feature map is used for indicating the probability that each pixel point in the first picture is taken as an interest point;

setting the values of non-object pixel points in the second feature map and pixel points with interest points having a probability smaller than the first preset threshold value as 0 according to the layer of the object mask map and the first preset threshold value to obtain a processed second feature map;

and traversing each pixel point in the processed second feature map, and if the value of the currently traversed pixel point is greater than the values of the other pixel points in the surrounding range of M × N, determining the currently traversed pixel point as an interest point, wherein M is an integer greater than or equal to 1, and N is an integer greater than or equal to 1.

In a possible design, the processing the first feature map and the at least one interest point according to an interest point control network and an instance selection network to obtain a target mask map corresponding to each of the at least one interest point includes:

processing the first feature map and the at least one interest point according to the interest point control network to obtain variance features and mean features corresponding to the interest points respectively;

and processing the first feature map, the variance feature and the mean feature according to the example selection network to obtain a target mask map corresponding to each interest point.

In one possible design, the processing the first feature map and the at least one interest point according to the interest point control network to obtain a variance feature and a mean feature corresponding to each interest point respectively includes:

inputting the first feature map into a point of interest control network so that the point of interest control network outputs a third feature map, wherein the third feature map is used for indicating the data distribution feature of the point of interest;

processing the third feature map and the first interest point according to a first algorithm aiming at any first interest point in the N interest points to obtain a fourth feature map, wherein the fourth feature map comprises features of the coordinates of the first interest point in the third feature map;

inputting the fourth feature map into a first fully-connected convolutional layer to obtain a first control value of the first interest point, and inputting the fourth feature map into a second fully-connected convolutional layer to obtain a second control value of the second interest point, wherein the first control value is used for indicating variance characteristics of the interest point, and the second control value is used for indicating mean characteristics of the interest point.

In a possible design, the processing the first feature map, the variance feature, and the mean feature by the network selected according to the example to obtain a target mask map corresponding to each of the at least one interest point includes:

inputting the first feature map into an instance selection network, so that the instance selection network outputs a fifth feature map, wherein the fifth feature map is used for indicating semantic information features of the at least one object in the first image;

performing normalization operation on the fifth feature map, and performing multiplication operation on the normalized fifth feature map and a first control value of a first interest point in the N interest points and performing addition operation on the normalized fifth feature map and a second control value of the first interest point to obtain a sixth feature map, wherein the data distribution of the sixth feature map is the same as the data distribution of the first interest point;

and inputting the sixth feature map into at least one convolutional layer module to obtain a first target mask map corresponding to the first interest point.

In a possible design, the determining, according to a target mask map corresponding to each of the at least one interest point, a position of each object in the first picture to extract the at least one object includes:

determining the repeated area among the target mask images according to the target mask images corresponding to the at least one interest point;

deleting a second target mask image to obtain at least one residual target mask image, wherein the second target mask image is one of a plurality of target mask images with the repeat area larger than a second preset threshold, and one target mask image corresponds to one object;

and respectively determining the position of each object in the first picture according to the remaining at least one target mask image so as to extract the at least one object.

In one possible design, before the inputting the first picture including the at least one object into the semantic segmentation feature network, the method further includes:

and carrying out scaling processing and normalization processing on the first picture.

In one possible design, the method further includes:

constructing an algorithm network, wherein the algorithm network comprises at least one of the following networks: a semantic segmentation feature network, a semantic segmentation classification network, an interest point generation network, an interest point control network and an instance selection network;

training the algorithm network according to multiple groups of sample data to obtain a trained algorithm network, wherein each group of sample data comprises a sample picture, a semantic tag, an interest point tag and a mask tag, the sample picture comprises at least one sample object, the semantic tag is used for indicating the position of the at least one sample object in the sample picture, the interest point tag is used for indicating the position of a sample interest point corresponding to each sample object in the sample picture, and the mask tag is used for indicating the position of a sample target mask map corresponding to each sample interest point in the sample picture.

In one possible design, the training the algorithm network according to multiple sets of sample data includes:

inputting the sample picture into the algorithm network so that the algorithm network outputs a semantic result graph, an interest point result graph and a mask result graph;

and calculating cross entropy loss functions according to the semantic result graph, the interest point result graph and the mask result graph, the semantic label, the interest point label and the mask label respectively, and training the algorithm network by using a back propagation algorithm.

In one possible design, the object is a shaft.

In a second aspect, an embodiment of the present application provides an object extraction apparatus, including:

an input module, configured to input a first picture including at least one object to a semantic segmentation feature network, so that the semantic segmentation feature network outputs a first feature map, where the first feature map is used to indicate an overall feature of the first picture;

the processing module is used for processing the first feature map according to a semantic segmentation classification network to obtain a semantic segmentation result map, wherein the semantic segmentation result map comprises a map layer of an object mask map and a map layer of a background mask map;

the processing module is further configured to process the image layers of the first feature map and the object mask map according to a point-of-interest generation network to obtain at least one point-of-interest, where one point-of-interest corresponds to one object in the first image;

the processing module is further configured to process the first feature map and the at least one interest point according to an interest point control network and an instance selection network to obtain a target mask map corresponding to each of the at least one interest point, where the target mask map is used to indicate a position of an object corresponding to the interest point in the first picture;

the processing module is further configured to determine, according to a target mask map corresponding to each of the at least one interest point, a position of each object in the first picture, respectively, so as to extract the at least one object.

In one possible design, the processing module is specifically configured to:

In one possible design, the processing module is further to:

before the first picture comprising at least one object is input into the semantic segmentation feature network, carrying out scaling processing and normalization processing on the first picture.

In one possible design, the apparatus further includes: a training module;

the training module is used for constructing an algorithm network, wherein the algorithm network comprises at least one of the following networks: a semantic segmentation feature network, a semantic segmentation classification network, an interest point generation network, an interest point control network and an instance selection network;

In one possible design, the training module is specifically configured to:

and calculating a cross entropy loss function according to the semantic result graph, the interest point result graph and the mask result graph, the semantic label, the interest point label and the mask label respectively, and training the algorithm network by utilizing a back propagation algorithm.

In one possible design, the object is a shaft.

In a third aspect, an embodiment of the present application provides an object extraction device, including:

a memory for storing a program;

a processor for executing the program stored by the memory, the processor being adapted to perform the method as described above in the first aspect and any one of the various possible designs of the first aspect when the program is executed.

In a fourth aspect, embodiments of the present application provide a computer-readable storage medium, comprising instructions which, when executed on a computer, cause the computer to perform the method as described above in the first aspect and any one of the various possible designs of the first aspect.

The embodiment of the application provides an object extraction method and device, wherein the method comprises the following steps: the method comprises the steps of inputting a first picture comprising at least one object into a semantic segmentation feature network, so that the semantic segmentation feature network outputs a first feature map, wherein the first feature map is used for indicating overall features of the first picture. And processing the first characteristic graph according to the semantic segmentation classification network to obtain a semantic segmentation result graph, wherein the semantic segmentation result graph comprises a graph layer of an object mask graph and a graph layer of a background mask graph. And processing the image layers of the first characteristic image and the object mask image according to the interest point generating network to obtain at least one interest point, wherein one interest point corresponds to one object in the first image. And processing the first characteristic graph and the at least one interest point according to the interest point control network and the instance selection network to obtain a target mask graph corresponding to the at least one interest point, wherein the target mask graph is used for indicating the position of an object corresponding to the interest point in the first picture. And respectively determining the position of each object in the first picture according to the target mask image corresponding to each interest point so as to extract at least one object. In the embodiment, at least one interest point is generated, the corresponding mask map is selectively generated according to the characteristics of each interest point, the position of each object in the first picture can be obtained according to each target mask map, and the interest points are not affected by the shape proportion, the size and the like of the object and are easily obtained through network training, so that the poor extraction effect caused by detection of the bounding box is effectively avoided, and the effect and the accuracy of the extracted rod-shaped object are effectively ensured.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present application, and for those skilled in the art, other drawings can be obtained according to these drawings without inventive exercise.

Fig. 1 is a schematic diagram of a network structure of a convolutional neural network provided in an embodiment of the present application;

fig. 2 is a flowchart of an object extraction method provided in an embodiment of the present application;

fig. 3 is a flowchart of an object selection method provided in an embodiment of the present application;

fig. 4 is a schematic view of a first picture provided in the present application;

FIG. 5 is a diagram illustrating a semantic segmentation result graph according to an embodiment of the present disclosure;

FIG. 6 is a schematic diagram illustrating an implementation of determining a point of interest according to an embodiment of the present disclosure;

FIG. 7 is a schematic diagram of a point of interest provided by an embodiment of the present application;

FIG. 8 is a schematic diagram illustrating an implementation of determining a data distribution characteristic of a point of interest according to an embodiment of the present application;

FIG. 9 is a schematic diagram of a point of interest and a corresponding target mask map provided in an embodiment of the present application;

FIG. 10 is a flowchart of training an algorithm network according to an embodiment of the present disclosure;

fig. 11 is a schematic flowchart of an object extraction method according to an embodiment of the present application;

fig. 12 is a schematic structural diagram of an object extraction apparatus according to an embodiment of the present application;

fig. 13 is a schematic hardware structure diagram of an object extraction device according to an embodiment of the present application.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some embodiments of the present application, but not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.

First, a convolutional neural network referred in the present application is described with reference to fig. 1, where fig. 1 is a schematic network structure diagram of the convolutional neural network provided in the embodiment of the present application:

a convolutional neural network: convolutional Neural Networks (CNN) are a type of feed-forward Neural network that includes convolution calculations and has a deep structure, and are one of the representative algorithms for deep learning. The convolutional neural network has a characterization learning capability, and can perform translation invariant classification on input information according to a hierarchical structure thereof, and is therefore also referred to as a "translation invariant artificial neural network", where a network structure diagram of the CNN is shown in fig. 1, the CNN includes a convolutional layer and a pooling layer, and the convolutional layer and the pooling layer are respectively described below:

convolution (conv) layer: the convolution layer has the function of extracting the characteristics of an input static image, the interior of the convolution layer comprises a plurality of convolution kernels, and each element forming the convolution kernels corresponds to a weight coefficient and a deviation amount and is similar to a neuron of a feedforward neural network. Each neuron in the convolution layer is connected with a plurality of neurons in an area close to the position in the previous layer, the size of the area depends on the size of a convolution kernel, the convolution kernel regularly sweeps an input characteristic when in work, and matrix element multiplication summation and deviation value superposition are carried out on the input characteristic in a receptive field.

Pooling (pooling) layer: after the feature extraction is performed on the convolutional layer, the output feature map is transmitted to the pooling layer for feature selection and information filtering. The pooling layer contains a pre-set pooling function whose function is to replace the result of a single point in the feature map with the feature map statistics of its neighboring regions. The step of selecting the pooling area by the pooling layer is the same as the step of scanning the characteristic diagram by the convolution kernel, and the pooling size, the step length and the filling are controlled.

Through the processing of the convolutional layer and the pooling layer, an image feature vector of the input image may be obtained, wherein the image feature vector may include at least one channel.

In the process of processing the convolutional neural network, a pattern formed by combining a plurality of features can be extracted into one feature by convolving a feature map in a certain range, and the next feature map is obtained. And continuing to convolute the feature graph, and continuously combining the features to obtain a more complex feature graph. And because of the existence of the pooling layer, the strongest features in a certain range can be continuously extracted, and the tensor size is reduced, so that the feature combination in the large range can be captured.

In a general sense, the channels refer to the color channels of the picture, and the feature map refers to the output result of the convolution filter. In practice the nature of the channel and feature maps are the same and both may be used to represent data for a distribution of features on previous inputs.

For example, when the input picture is a color picture of RGB channels, the input picture may include 3 channels, namely three channels of Red (Red), Green (Green), and Blue (Blue), where one channel is a detection of a certain feature, and the magnitude of a value in the channel is a response to the strength of the current feature, for example, in a Blue channel, if 256 levels are used, a pixel is 255, which indicates that the degree of Blue is large, and the meaning of the value in the feature map is similar to that of the channel, where the magnitude of the value is a response to the strength of the current feature.

The background art to which the present application relates is explained in detail below:

Currently, in the related art, when extracting a rod object in image information, there are two possible implementations:

in a possible implementation manner, the boundary of the bounding box of a single target in the image can be detected first, then the semantic segmentation of the foreground and the background is completed by adopting a segmentation method, and finally the boundary of the bounding box of the single target and the result of the semantic segmentation of the foreground and the background are combined to realize the extraction of each rod-shaped object in the image.

However, in the implementation of the pre-detection and re-segmentation described above, since the shape of the rod object is elongated, the detection effect of the bounding box is poor, and specifically, the detection of the bounding box is currently performed by using an anchor-based detection method, wherein the anchor-based detection method needs to define the size and the ratio of the anchor in advance, for example, the ratio can be set to 1:1,1:2,1:3, and the size can be set to 0.1,0.2,0.3, 9 shapes in total, if the set shape is small, the detection effect is poor, and if the set shape is large, the processing speed is slow, it can be understood that, because the elongated shape of the rod object, the ratio of the rod object exists in shapes of 1:1 to 1:10, 1:20, and the size is from 0.1 to 1.0, and the like, and therefore, the detection effect of the bounding box is poor, there are great challenges, which in turn result in low accuracy in extracting the object.

And a detection method based on the regression boundary of the central point of the object can be used for detecting the bounding box, however, the probability that the central point is in the object and the central point of the rod-shaped object is not in the rod is higher in the method, so that the detection effect of the bounding box is still poor due to the adoption of the method, the identification effect of the rod-shaped object is poor, and the accuracy of the extracted rod-shaped object is low.

In another possible implementation manner, semantic segmentation of the target can be performed on the image to obtain segmentation results of the foreground and the background, and then the target is separated into single instance results by using post-processing manners such as clustering and the like to extract each rod-shaped object in the image.

However, the above-described implementation of segmentation-before-post-processing requires many manual and experienced parameter adjustments during post-processing, which may result in poor processing robustness, for example, in a post-processing mode: after the semantic segmentation result, the semantic segmentation result may be subjected to connected domain analysis according to experience. Wherein, for the disconnected part, a certain range of vertical or horizontal areas can be connected according to the experience that the shaft is generally vertical or horizontal. For the joint, based on the experience that the shaft is generally vertical or horizontal, the result can be projected in the vertical or horizontal direction to obtain histogram distribution, and a certain threshold value is set to separate the joint. However, in the actual implementation process, because there are different shooting angles and different situations when shooting image data, and a part of the rods are not vertical or horizontal, the parameters are set with a good effect in one scene before, and the effect in the next scene is poor, so that the robustness is poor, and at this time, the parameters need to be manually readjusted according to the actual situation.

In view of the above-mentioned introduced problems, the present application provides an object extraction method, which extracts interest points corresponding to each object through an algorithm network, and determines mask maps corresponding to each interest point, where the mask maps corresponding to each interest point can indicate the position of each object in an image, so that extraction of a rod-shaped object from the image is effectively achieved without performing post-processing operations such as bounding box detection and clustering, and a recognition effect of the rod-shaped object can be effectively ensured by training the algorithm network in advance, and obtaining the interest points and the mask maps corresponding to the interest points according to the trained algorithm network.

The object extraction method provided by the present application is described in detail below with reference to specific embodiments, and it should be noted that the execution main body in each embodiment in the present application may be a component having a data processing function, such as a server and a processor, and the implementation of the execution main body is not limited in this embodiment as long as the execution main body can perform data processing.

First, description is made with reference to fig. 2, and fig. 2 is a flowchart of an object extraction method provided in an embodiment of the present application.

As shown in fig. 2, the method includes:

s201, a first picture comprising at least one object is input into a semantic segmentation feature network, so that the semantic segmentation feature network outputs a first feature map, wherein the first feature map is used for indicating overall features of the first picture.

In this embodiment, the object in the first picture may be extracted, where the object may be, for example, a rod object described above, or in other possible implementations, the object may also be an object with a preset shape, and the preset shape may be, for example, a rectangle, a triangle, and the like.

In one possible implementation, the semantic segmentation feature Network may include, for example, an encoder and a decoder, where the encoder may be, for example, a Residual Network (ResNet) encoder, and the decoder may be, for example, a UNet decoder.

The semantic segmentation feature network processes the first picture, and may output a first feature map corresponding to the first picture, where the first feature map is used to indicate an overall feature of the first picture, and it may be determined based on the above description that a numerical value included in the first feature map may be used to indicate the strength of the feature.

S202, processing the first feature map according to the semantic segmentation classification network to obtain a semantic segmentation result map, and a map layer of an object mask map and a map layer of a background mask map in the semantic segmentation result map.

The semantic segmentation classification network in this embodiment is configured to perform semantic segmentation processing on the first feature map, so as to implement semantic segmentation of the object and the background, where the semantic segmentation classification network may obtain a semantic segmentation result map after processing the first feature map, and the semantic segmentation result map includes two layers, which are a layer of the object mask map and a layer of the background mask map respectively.

In the layer of the object mask map, the pixel point where the object is located is not 0, and the values of the other pixel points except the pixel point where the object is located are all 0, so that semantic segmentation of the object and the background can be realized.

And in the layer of the background mask image, the pixel point of the background is not 0, and the values of the other pixel points except the pixel point of the background are all 0, so that the semantic segmentation of the object and the background can be realized.

It can be understood that in the related image operation of the mask map, the image operation does not work on the pixel point with the value of 0 in the mask map.

The object mask map in this embodiment can effectively distinguish the object and the background in the first picture, but cannot distinguish each object, for example, if two rod objects exist in the first picture, the pixel value of the position of the two rod objects in the object mask map is not 0, but cannot distinguish the two rod objects, so that subsequent operations are required to extract each object in the first picture.

S203, processing the first characteristic graph and the object mask graph according to the interest point generating network to obtain at least one interest point, wherein one interest point corresponds to one object in the first picture.

The interest point generating network in this embodiment is used to generate an interest point of an object, where the interest point may be any point in the object, and in a possible implementation manner, one interest point may be generated for one object, or multiple interest points may be generated for one object, where one interest point corresponds to one object, that is, the interest points generated by the object and the interest point generating network in this embodiment are in a one-to-many relationship.

In a possible implementation manner, the first feature map may be processed according to the interest point generation network to obtain a feature map related to interest points, and then, the position of the object is screened from the feature map according to the feature map related to interest points and the object mask map, so as to generate at least one interest point corresponding to each object.

In another possible implementation manner, at least one interest point may also be randomly selected within the range of the object mask map, as long as one interest point corresponds to one object and the interest point is a point within the range of the object mask map.

It should be noted that the generation of the interest points is not affected by the shape proportion and size of the object, so that in this embodiment, by generating at least one interest point and subsequently extracting the object based on the interest point, it may be effectively avoided that the effect of extracting the object is not good due to unbalanced length-width proportion and size of the bounding box when detecting the bounding box.

S204, processing the first characteristic graph and the at least one interest point according to the interest point control network and the instance selection network to obtain a target mask graph corresponding to the at least one interest point, wherein the target mask graph is used for indicating the position of an object corresponding to the interest point in the first picture.

In a possible implementation manner, the interest point control network may generate a variance feature and a mean feature of each interest point according to the first feature map.

In a possible implementation manner, taking any one of the at least one interest point as an example, the example selection network may perform processing according to the variance feature and the mean feature of the interest point, so as to obtain a target mask map corresponding to the interest point.

The target mask map is similar to the object mask map described above, but different in that the target mask map in this embodiment is used to indicate the position of a single object corresponding to the interest point in the first picture, that is, with respect to the object mask map, a pixel point in the target mask map which is not 0 is a pixel point of the position where the single object is located, so that the positions of the objects in the first picture can be respectively indicated.

S205, respectively determining the position of each object in the first picture according to the target mask image corresponding to each interest point so as to extract at least one object.

Each interest point corresponds to a respective target mask map, and each interest point corresponds to a respective object, so that each target mask map also corresponds to one object.

In a possible implementation manner, an object may correspond to multiple target mask maps, and then one target mask map may be selected from the multiple target mask maps of the object as the target mask map of the object, for example, one target mask map may be randomly selected from the multiple target mask maps, or the target mask map obtained first may be selected as the target mask map of the object according to a generation order of the target mask maps.

After the target mask image corresponding to each object is obtained, the position of each object in the first image may be determined according to a pixel point in the target mask image that is not 0, so as to extract at least one first object from the first object.

The object extraction method provided by the embodiment of the application comprises the following steps: the method comprises the steps of inputting a first picture comprising at least one object into a semantic segmentation feature network, so that the semantic segmentation feature network outputs a first feature map, wherein the first feature map is used for indicating overall features of the first picture. And processing the first characteristic graph according to the semantic segmentation classification network to obtain a semantic segmentation result graph, wherein the semantic segmentation result graph comprises a graph layer of an object mask graph and a graph layer of a background mask graph. And processing the image layers of the first characteristic image and the object mask image according to the interest point generating network to obtain at least one interest point, wherein one interest point corresponds to one object in the first image. And processing the first characteristic graph and the at least one interest point according to the interest point control network and the instance selection network to obtain a target mask graph corresponding to the at least one interest point, wherein the target mask graph is used for indicating the position of an object corresponding to the interest point in the first picture. And respectively determining the position of each object in the first picture according to the target mask image corresponding to each interest point so as to extract at least one object. In the embodiment, at least one interest point is generated, the corresponding mask map is selectively generated according to the characteristics of each interest point, the position of each object in the first picture can be obtained according to each target mask map, and the interest points are not affected by the shape proportion, the size and the like of the object and are easily obtained through network training, so that the poor extraction effect caused by detection of the bounding box is effectively avoided, and the effect and the accuracy of the extracted rod-shaped object are effectively ensured.

On the basis of the foregoing embodiments, with reference to fig. 3 to fig. 8, an embodiment of an object selection method provided in the present application is further described in detail below, fig. 3 is a flowchart of the object selection method provided in the present application embodiment, fig. 4 is a schematic diagram of a first picture provided in the present application embodiment, fig. 5 is a schematic diagram of a semantic segmentation result graph provided in the present application embodiment, fig. 6 is an implementation schematic diagram of determining interest points provided in the present application embodiment, fig. 7 is a schematic diagram of interest points provided in the present application embodiment, fig. 8 is an implementation schematic diagram of data distribution features of the determined interest points provided in the present application embodiment, and fig. 9 is a schematic diagram of the interest points and corresponding target mask graphs provided in the present application embodiment.

As shown in fig. 3, the method includes:

s301, scaling and normalizing the first picture.

An implementation of the first picture may be, for example, as shown in fig. 4, where the first picture illustrated in fig. 4 may include two rod objects, respectively the rod objects indicated by 401 and 402.

In this embodiment, before performing the correlation processing on the first picture, the first picture may be preprocessed first.

The first picture is zoomed, the first picture can be zoomed to a designated size, the first picture is normalized, and values of all pixel points in the first picture can be mapped into a range of 0-1 for processing.

If the normalization processing is not performed on the first picture, the neural network learning speed is slow or even the neural network learning is difficult to learn due to the scattered sample feature distribution, so that the normalization of the input picture in the subsequent processing process can be effectively ensured by performing the scaling processing and the normalization processing on the first picture, the processing efficiency is improved, the features can be ensured to be treated equally by the classifier, and the accuracy of the output result is effectively ensured.

In one possible implementation, the data may be normalized based on the mean and variance of the raw data, for example, the normalization process may satisfy the following formula one:

where x is the original data before processing, μ is the mean of all data, σ is the variance of all data, x^*Is the data after the normalization process.

In another possible implementation manner, for example, normalization processing may be performed according to a maximum value of data and a minimum value of the data, and the specific implementation manner of the normalization processing is not limited in this embodiment, and may be selected according to actual requirements as long as values of the respective pixel points can be mapped into a range of 0 to 1.

S302, a first picture comprising at least one object is input into a semantic segmentation feature network, so that the semantic segmentation feature network outputs a first feature map, wherein the first feature map is used for indicating overall features of the first picture.

The implementation manner of S302 is similar to that of S201, and is not described here again.

In this embodiment, the first picture input to the semantic segmentation feature network is the first picture after the scaling processing and the normalization processing.

S303, processing the first feature map according to the semantic segmentation classification network to obtain a semantic segmentation result map, wherein the semantic segmentation result map comprises a map layer of an object mask map and a map layer of a background mask map.

Wherein, the implementation manner of S303 is similar to that of S202.

In a possible implementation manner, the semantic segmentation classification network in this embodiment may adopt a convolution kernel of 1 × 1, so as to output a semantic segmentation result map, so as to obtain two map layers, i.e., a background mask map and an object mask map.

For example, as shown in fig. 5, the implementation manner of the semantic segmentation result map may be that a white area in fig. 5 is a layer of an object mask map, and a black area is a layer of a background mask map.

S304, inputting the first feature map into the interest point generation network, so that the interest point generation network outputs a second feature map, wherein the second feature map is used for indicating the probability that each pixel point in the first picture is taken as the interest point.

In this embodiment, the obtained first feature map may be input to an interest point generation network, and in a possible implementation manner, a convolution layer portion of the interest point generation network may be formed by 3 × 3 and 1 × 1 convolution kernels, where the interest point generation network processes the first feature map to output a second feature map, where the second feature map is used to indicate a probability that each pixel point in the first picture is used as an interest point.

It can be understood that the value in the second feature map may be a value in a range of 0 to 1, and if the value is larger, it indicates that the probability of the pixel point as the interest point is larger, and correspondingly, if the value is smaller, it indicates that the probability of the pixel point as the interest point is smaller.

S305, setting the values of non-object pixel points in the second feature map and pixel points with interest points having a probability smaller than a first preset threshold value as 0 according to the layer of the object mask map and the first preset threshold value, and obtaining the processed second feature map.

In this embodiment, at least one interest point needs to be screened from the first picture, where the interest point is a point in an object, and the value of a pixel point of a non-object in the second feature map may be set to 0 according to the layer of the object mask map, that is, the value of a pixel point of the background portion is set to 0.

And the second feature map in this embodiment may indicate the probability that each pixel point is taken as an interest point, in this embodiment, the pixel points may be filtered according to a first preset threshold, and the value of the pixel point with the probability that the interest point is taken as the interest point is smaller than the first preset threshold is set to 0, so as to filter the pixel point with the probability that the probability is smaller than the first preset threshold, and obtain the processed second feature map.

The values of the non-object pixel points and the pixel points with the probability of being the interest points smaller than the first preset threshold value in the processed second characteristic diagram are both 0, then the pixel points which are not 0 in the second characteristic diagram are the pixel points with the object and the probability of being the interest points smaller than the first preset threshold value, and then the interest points are screened from the processed second characteristic diagram, so that the effectiveness and the accuracy of the obtained interest points can be effectively guaranteed.

In a possible implementation manner, the first preset threshold may be set to 0.5, for example, and in an actual implementation process, the first preset threshold may be selected according to an actual requirement according to a requirement on a recall rate or an accuracy rate. For example, if the recall rate is required to be high, that is, all the rods are required to be identified preferentially, and the error requirement is not high, the first preset threshold value may be set to 0.4; or if the required accuracy is high, that is, the error tolerance is low, the first preset threshold may be set to 0.6, and the specific setting mode of the first preset threshold may be selected according to the actual requirement.

S306, traversing each pixel point in the processed second feature map, and if the value of the currently traversed pixel point is larger than the values of the other pixel points in the surrounding range of M × N, determining the currently traversed pixel point as an interest point, wherein M is an integer larger than or equal to 1, and N is an integer larger than or equal to 1.

Wherein one interest point corresponds to one object in the first picture.

In this embodiment, in the processed second feature map, the pixel point corresponding to the position of the object is not 0, and the probability that the part of the pixel points are taken as the interest points is greater than the preset threshold, at this time, at least one interest point needs to be selected from the pixel points not 0.

In a possible implementation manner, traversal judgment can be performed on all pixels of the processed second feature map, whether the value of each traversed pixel point is larger than the values of the other pixels in the mxn range around the pixel point is judged for each traversed pixel point, if yes, the currently traversed pixel point is determined to be the interest point, and if not, the currently traversed pixel point is determined not to be the pixel point.

In one possible implementation, the M × N range around the currently traversed pixel may be an M × N range centered on the currently traversed pixel, and values of M and N may be equal and may be odd.

One possible implementation of determining the points of interest is described below with reference to fig. 6, where M and N are both 3.

Referring to fig. 6, for example, as shown in fig. 6, which is a portion of the processed second feature map, assuming that the currently traversed pixel point is the pixel point shown in 601, the value of the pixel point is 0.7, for example, the coordinate position of the current pixel is (5,5), and the 3 × 3 range around the pixel point is the range indicated by 602, the value of the pixel point 601 is compared with the value of each pixel point in the range 602, and it can be determined that the value of the pixel point in fig. 6 is greater than the values of the pixel points in the surrounding 3 × 3 ranges, namely, the values of the pixel points of (4,4), (4,5), (4,6), (5,4), (5,6), (6,4), (6,5), (6,6) are the largest, then the pixel point 601 can be determined as the interest point, and traversal is performed for each pixel point in fig. 6, thus, a plurality of interest points shown in fig. 6 can be obtained, that is, pixel points indicated by each shaded portion in fig. 6.

Referring to fig. 7, an obtained interest point is described below, as shown in fig. 7, the interest point may be a point 701 in fig. 7, where the interest point is a point in the shaft object, and in an implementation process, the number and the position of the interest points may be selected according to an actual requirement, which is not limited in this embodiment.

S307, inputting the first feature map into the interest point control network, so that the interest point control network outputs a third feature map, wherein the third feature map is used for indicating the data distribution feature of the interest point.

After obtaining at least one interest point, in this embodiment, a mask map needs to be generated for each interest point, and the data distribution characteristics of each interest point may be learned first, so as to generate a mask map corresponding to each interest point.

The first feature map may be input to the interest point control network, the interest point control network in this embodiment processes the first feature map, and may output a third feature map, where the third feature map may indicate a data distribution feature of the interest point, and the interest point control network may include, for example, structures such as a 3 × 3 and 1 × 1 convolutional layer, a roilign layer, and a fully-connected convolutional layer.

S308, aiming at any first interest point in the N interest points, processing the third feature map and the first interest point according to a first algorithm to obtain a fourth feature map, wherein the fourth feature map comprises features of the coordinates of the first interest point in the third feature map.

In this embodiment, each interest point corresponds to its respective coordinate, that is, its respective (x, y), which is described below by taking any first interest point of the N interest points as an example, and the implementation manners of the other interest points are similar.

The implementation of obtaining the data distribution characteristics of the interest points may be understood with reference to fig. 8, referring to fig. 8, the first feature map may be input to the interest point control network to obtain a third feature map, and then, for the first interest point, the third feature map and the first interest point may be processed according to the first algorithm to obtain a fourth feature map.

As can be understood by referring to fig. 8, the fourth feature map is a part of the third feature map, and the fourth feature map includes features corresponding to the coordinates of the first interest point in the third feature map, for example, if the coordinates of the current first interest point are (x, y), then features corresponding to the coordinates (x, y) on the z-axis may be extracted according to the coordinates of the first interest point, so as to obtain a fourth feature map, where each interest point corresponds to a respective fourth feature map.

In one possible implementation, the first algorithm may be, for example, a roiign algorithm, where the roiign algorithm is an algorithm for performing region feature aggregation.

S309, inputting the fourth feature map into the first fully-connected convolutional layer to obtain a first control value of the first interest point, and inputting the fourth feature map into the second fully-connected convolutional layer to obtain a second control value of the second interest point, wherein the first control value is used for indicating variance characteristics of the interest point, and the second control value is used for indicating mean characteristics of the interest point.

After the fourth feature map corresponding to the first interest point is obtained, the fourth feature map may be input into the fully-connected convolutional layer, so as to obtain a control value corresponding to the first interest point.

In a possible implementation manner, the first interest point may be mapped to a feature of a corresponding position in the third feature map, and the data distribution represented by the feature of the position is obtained through learning, so as to learn the variance feature information and the mean feature information respectively. For example, a fourth feature map may be input to a first fully-connected convolutional layer to obtain a first control value for a first point of interest, where the first control value is indicative of a variance feature of the point of interest, and the fourth feature map may be input to a second fully-connected convolutional layer to obtain a second control value for a second point of interest, where the second control value is indicative of a mean feature of the point of interest.

After the variance feature and the mean feature of the first interest point are obtained, the data distribution feature of the first interest point is obtained, and the above operation is performed on each interest point of at least one interest point, so that the variance feature and the mean feature corresponding to each interest point are obtained.

S310, inputting the first feature map into an example selection network, so that the example selection network outputs a fifth feature map, wherein the fifth feature map is used for indicating semantic information features of at least one object in the first image.

In this embodiment, the convolution layer portion of the instance selection network may be composed of 3 × 3 and 1 × 1 convolution kernels, wherein the instance selection network may output a fifth feature map indicating semantic information features of the at least one object in the first image.

S311, carrying out normalization operation on the fifth feature map, carrying out multiplication operation on the normalized fifth feature map and a first control value of the first interest point, and carrying out addition operation on the normalized fifth feature map and a second control value of the first interest point to obtain a sixth feature map, wherein the data distribution of the sixth feature map is the same as the data distribution of the first interest point.

In the above processing procedure, the mean characteristic and the mean variance corresponding to each interest point are obtained in this embodiment, and in order to obtain the mask map corresponding to each interest point, the data distribution in the fifth feature map for indicating the semantic information feature may be changed into the data distribution characteristic corresponding to the interest point.

In a possible implementation manner, a normalization operation may be performed on the fifth feature map, where the normalization operation may be performed by the above-described process of processing through the mean and the variance, the mean is first subtracted, and then the result of subtracting the mean is divided by the variance, and by performing the normalization operation on the fifth feature map, the data distribution in the fifth feature map may be first eliminated.

And then, performing multiplication operation on the fifth feature map after the normalization operation and the variance feature of the first interest point, and performing addition operation on the result of the multiplication operation and the mean feature of the first interest point to obtain a sixth feature map, thereby realizing the purpose of changing the data distribution of the fifth feature map into the data distribution feature of the interest point, wherein the data distribution of the sixth feature map is the same as the data distribution of the first interest point.

In this embodiment, the first interest point is illustrated as an example to be introduced, a sixth feature map corresponding to the first interest point is obtained, and in an actual implementation process, for each interest point, a sixth feature map corresponding to each interest point is obtained.

S312, inputting the sixth feature map into at least one convolutional layer module to obtain a first target mask map corresponding to the first interest point.

Taking the first interest point as an example, after obtaining the sixth feature map corresponding to the first interest point, the sixth feature map may be input into the plurality of convolutional layer modules, because the data distribution of the sixth feature map is the same as the data distribution of the first interest point, and the sixth feature map is obtained through the fifth feature map for indicating the semantic information feature of the at least one object in the first image, so that a mask result map corresponding to the first interest point may be generated.

If each interest point corresponds to a respective object, a mask result graph corresponding to each interest point is generated, and actually, the mask result graph corresponding to each object is obtained.

In a possible implementation manner, an implementation manner of the mask result graph corresponding to each interest point may be, for example, as shown in fig. 9, see fig. 9, and it is assumed that 5 interest points are currently determined, one interest point corresponds to one graph, and points shown by 901, 902, 903, 904, and 905 in the 5 graphs are 5 interest points. Each interest point corresponds to a respective mask result image, namely a black-and-white image on the left side of the 5 images.

In the mask result graph corresponding to the interest point, the white area is the target mask graph, and the black area is the background mask graph, and it can be determined with reference to fig. 9 that only one object exists in each target mask graph, thereby realizing separation of each object.

S313, determining the repeated area among the target mask images according to the target mask images corresponding to the at least one interest point.

And S314, deleting the second target mask image to obtain at least one residual target mask image, wherein the second target mask image is one of a plurality of target mask images with the repetition area larger than a second preset threshold, and one target mask image corresponds to one object.

It can be confirmed in fig. 9 that multiple interest points may be generated for the same object in this embodiment, so that multiple mask maps are generated for the same object, for example, only 2 rod objects exist in the first picture in fig. 9, but 5 mask maps are finally generated.

Therefore, in this embodiment, the redundant target mask map is deleted, and only one target mask map of the target mask maps with the repetition area larger than the preset threshold is reserved as the target mask map of one of the objects.

For example, the repeat area between each target mask map may be determined, and referring to FIG. 9, assuming that the repeat area between target mask map 906 and target mask map 907 may be determined, where no repeat area exists between target mask map 906 and target mask map 907, then no pruning is performed; then, the repetition area between the target mask map 906 and the target mask map 909 may be determined, and if it is determined that the repetition area between the target mask map 906 and the target mask map 909 is greater than the second preset threshold, the target mask map 909 or the target mask map 906 may be deleted, and the above operations are performed on each target mask map, so that the target mask maps corresponding to the two rod objects may be finally obtained.

In this embodiment, for a plurality of target mask maps with a repetition area larger than a preset threshold, which target mask map is specifically deleted is not limited, and for example, one target mask map may be randomly deleted, or one target mask map with a generation time earlier may be deleted.

In one possible implementation manner, the target mask maps corresponding to the two rod-shaped objects obtained finally may be two separate target mask maps; alternatively, the target mask maps corresponding to the two shaft objects finally obtained in this embodiment may also be merged target mask maps, where the target mask maps include respective target mask maps of the two shaft objects, and there is a correspondence between the target mask maps and the respective shaft objects, so as to distinguish the two shaft objects.

And S315, respectively determining the position of each object in the first picture according to the remaining at least one target mask image so as to extract at least one object.

In this embodiment, after deleting the repeated target mask map, obtaining at least one remaining target mask map, where a portion of the target mask map that is not 0 is a position of the object in the first picture, so that a position of each object in the first picture can be determined, and after determining the position of each object, extracting at least one object class ii can be implemented.

The object extraction method provided by the embodiment of the application comprises the following steps: the normalization of the first picture input into the algorithm network can be effectively ensured by carrying out scaling processing and normalization processing on the first picture, so that the processing efficiency is effectively improved, the second characteristic graph is processed through the layer of the object mask graph and the first preset threshold value, so that pixel points of the background and pixel points with low probability are screened out, and the pixel points with the maximum value are selected according to regions in the remaining pixel points, so that at least one interest point is obtained, the accuracy and the validity of the obtained interest point can be effectively ensured, then a mask graph corresponding to each interest point is generated aiming at each interest point, so that at least one mask graph corresponding to each object is obtained, repeated mask graphs are deleted according to the repeated area among the mask graphs, so that the distinguishing and the determination of the position of each object in the first image are realized, and each rod-shaped object in the first image is respectively extracted, in the process, the algorithm network is adopted to carry out a series of processing, so that the accuracy of the extracted object result can be effectively ensured.

On the basis of the foregoing embodiment, before the first picture is processed based on the above-described algorithm networks, the algorithm networks may also be trained to ensure the accuracy of the result output by each algorithm network in the present application, and the following describes an implementation manner of training the algorithm networks in the present application with reference to a specific embodiment, and fig. 10 is a flowchart of training the algorithm networks provided in the embodiment of the present application.

As shown in fig. 10, the method includes:

s1001, constructing an algorithm network, wherein the algorithm network comprises at least one of the following networks: the system comprises a semantic segmentation feature network, a semantic segmentation classification network, an interest point generation network, an interest point control network and an instance selection network.

S1002, training the algorithm network according to the multiple groups of sample data to obtain the trained algorithm network.

In this embodiment, the algorithm network may include the networks described in the above embodiments, and in this embodiment, the algorithm network may be trained by using multiple sets of sample data.

Each group of sample data may include a sample picture, a semantic tag, an interest point tag, and a mask tag, where the sample picture includes at least one sample object, the semantic tag is used to indicate a position of the at least one sample object in the sample picture, the interest point tag is used to indicate a position of a sample interest point corresponding to each sample object in the sample picture, and the mask tag is used to indicate a position of a sample target mask map corresponding to each sample interest point in the sample picture.

In one possible implementation, the semantic tag may be a set of a plurality of coordinate points, and the position of the sample object in the sample picture is indicated by the traveling of the coordinate points; the interest point label can be a coordinate point randomly determined in a set of a plurality of coordinate points of the semantic label, and it can be understood that the interest point is a point in the object, so that a seat interest point can be randomly selected from the set of coordinate points corresponding to the object; and, the mask label may be a set of a plurality of coordinate points, where the sample mask map is used to indicate the position of the sample object, and thus the mask label in this embodiment may be a set of coordinate points of the sample object corresponding to the interest point label.

It can be understood that the sample data in this embodiment is predetermined and can ensure the accuracy of the data therein, so that the accuracy of the output result after the algorithm network is trained can be ensured by training the algorithm network according to the sample data.

In one possible implementation, the sample picture may be input to an algorithm network, such that the algorithm network outputs a semantic result map, a point of interest result map, and a mask result map.

The implementation manner of the algorithm network outputting the semantic result graph and the interest point result mask result graph according to the sample picture is similar to the processing flow of each algorithm network described in the above embodiments, and is not described here again.

Then, a cross entropy loss function can be calculated according to the semantic result graph, the interest point result graph and the mask result graph and the semantic label, the interest point label and the mask label respectively, and the algorithm network is trained by utilizing a back propagation algorithm.

According to the object extraction method provided by the embodiment of the application, the algorithm network is trained according to the multiple groups of sample data before the algorithm network processes the first picture, so that the accuracy of the output result of the algorithm network can be effectively ensured, and the accuracy of the object finally extracted from the first picture is further ensured.

On the basis of the foregoing embodiment, the following flowchart schematically introduces a system to the implementation flow of the object extraction method provided by the present application, and fig. 11 is a flowchart schematically illustrating the object extraction method provided by the embodiment of the present application.

As shown in fig. 11:

first, a first picture is input to a semantic segmentation feature network for feature extraction, and the semantic segmentation feature network can use a ResNet encoder and a UNet decoder, so as to output a first feature map.

Before the first picture is input into the semantic segmentation feature network, the first picture can be subjected to scaling processing and normalization processing.

And then, inputting the first feature map into a semantic segmentation classification network, wherein the semantic segmentation classification network can output a semantic segmentation result map by using a 1 × 1 convolution kernel, and the semantic segmentation result map comprises two map layers which are respectively a map layer of an object mask map and a map layer of a background mask map, so that the semantic segmentation of the object and the background is preliminarily realized.

Then, the first feature map may be input to an interest point generation network, and a second feature map is generated, where a convolution layer portion of the interest point generation network may be formed by convolution of 3 × 3 and 1 × 1, the second feature map in this embodiment may be a candidate point feature map, and then, according to a layer of the object mask map and a first preset threshold, values of a non-object region and a region smaller than the first preset threshold in the second feature map are set to 0, so as to obtain a processed second feature map, and then, a region maximum point selection is performed on the processed second feature map, and a point with a maximum value in each region is screened out, so as to obtain N interest points.

And inputting the first characteristic diagram into the interest point control network to generate a third characteristic diagram, where the third characteristic diagram in this embodiment may be a control characteristic diagram, and the interest point control network may be configured by a 3 × 3 and 1 × 1 convolutional layer, a roilign layer, a fully-connected convolutional layer, and the like.

And then traversing each interest point, inputting the coordinates of the third feature map and the interest point into a RoIAlign algorithm for each interest point, extracting a corresponding fourth feature map, and then inputting the fourth feature map into a fully-connected convolutional layer to calculate a first control value and a second control value, wherein the meaning of the first control value can be the variance feature of the interest point, and the meaning of the second control value can be the mean feature of the interest point.

And inputting the first feature map into the example selection network to generate a fifth feature map, wherein a convolution layer part of the example selection network can be formed by convolution of 3 × 3, 1 × 1 and the like, then after the fifth feature map is normalized, the fifth feature map is multiplied by the first control value and added by the second control value, that is, the data distribution of the fifth feature map is changed into the data distribution learned by the interest points, so that a sixth feature map is obtained. And finally, inputting the sixth feature map into a plurality of convolutional layer modules to generate a target mask map corresponding to the interest point.

And performing the operations of obtaining the control value and generating the target mask map for each interest point to generate the target mask map corresponding to each interest point, then determining the repeat area of each target mask map, discarding the target mask map and the corresponding interest point with the repeat area exceeding a second preset threshold to obtain two classification mask results, and finally combining the retained corresponding mask map results of each interest point to obtain a final rod segmentation result, thereby obtaining a final object extraction result.

It can be understood that, for a bar-shaped object, the example segmentation result achieved by adopting a detection mode is poor due to the fact that the length-width ratio and the size of the enclosing frame of the bar-shaped object are not uniform. The interest points in the rod object area are not affected by the shape proportion, size and the like of the object and are easily obtained through network training, so that the object selection method provided by the embodiment of the application generates at least one interest point, controls and selects to generate a corresponding mask map according to the characteristics of each interest point, combines the mask maps generated by different interest points, completes the segmentation and extraction of each rod object, and effectively ensures the effect and the accuracy of the extracted rod object.

Fig. 12 is a schematic structural diagram of an object extraction apparatus according to an embodiment of the present application. As shown in fig. 12, the apparatus 120 includes: an input module 1201 and a processing module 1202.

An input module 1201, configured to input a first picture including at least one object to a semantic segmentation feature network, so that the semantic segmentation feature network outputs a first feature map, where the first feature map is used to indicate an overall feature of the first picture;

a processing module 1202, configured to process the first feature map according to a semantic segmentation classification network to obtain a semantic segmentation result map, where the semantic segmentation result map includes a layer of an object mask map and a layer of a background mask map;

the processing module 1202 is further configured to process the image layers of the first feature map and the object mask map according to a point of interest generation network to obtain at least one point of interest, where one point of interest corresponds to one object in the first picture;

the processing module 1202 is further configured to process the first feature map and the at least one interest point according to an interest point control network and an instance selection network to obtain a target mask map corresponding to each of the at least one interest point, where the target mask map is used to indicate a position of an object corresponding to the interest point in the first picture;

the processing module 1202 is further configured to determine, according to a target mask map corresponding to each of the at least one interest point, a position of each object in the first picture, respectively, so as to extract the at least one object.

In one possible design, the processing module 1202 is specifically configured to:

In one possible design, the processing module 1202 is further configured to:

In one possible design, the apparatus further includes: a training module 1203;

the training module 1203 is configured to construct an algorithm network, where the algorithm network includes at least one of the following networks: a semantic segmentation feature network, a semantic segmentation classification network, an interest point generation network, an interest point control network and an instance selection network;

In one possible design, the training module 1203 is specifically configured to:

In one possible design, the object is a shaft.

The apparatus provided in this embodiment may be used to implement the technical solutions of the above method embodiments, and the implementation principles and technical effects are similar, which are not described herein again.

Fig. 13 is a schematic diagram of a hardware structure of an object extraction device according to an embodiment of the present application, and as shown in fig. 13, an object extraction device 130 according to the embodiment includes: a processor 1301 and a memory 1302; wherein

A memory 1302 for storing computer-executable instructions;

the processor 1301 is configured to execute the computer execution instructions stored in the memory to implement the steps performed by the object extraction method in the foregoing embodiments. Reference may be made in particular to the description relating to the method embodiments described above.

Alternatively, the memory 1302 may be separate or integrated with the processor 1301.

When the memory 1302 is independently configured, the object extracting apparatus further includes a bus 1303 for connecting the memory 1302 and the processor 1301.

An embodiment of the present application further provides a computer-readable storage medium, where a computer executing instruction is stored in the computer-readable storage medium, and when a processor executes the computer executing instruction, the object extracting method performed by the above object extracting apparatus is implemented.

In the several embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. For example, the above-described device embodiments are merely illustrative, and for example, the division of the modules is only one logical division, and other divisions may be realized in practice, for example, a plurality of modules may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or modules, and may be in an electrical, mechanical or other form.

The integrated module implemented in the form of a software functional module may be stored in a computer-readable storage medium. The software functional module is stored in a storage medium and includes several instructions for enabling a computer device (which may be a personal computer, a server, or a network device) or a processor (processor) to execute some steps of the methods according to the embodiments of the present application.

It should be understood that the Processor may be a Central Processing Unit (CPU), other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), etc. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of a method disclosed in connection with the present invention may be embodied directly in a hardware processor, or in a combination of the hardware and software modules within the processor.

The memory may comprise a high-speed RAM memory, and may further comprise a non-volatile storage NVM, such as at least one disk memory, and may also be a usb disk, a removable hard disk, a read-only memory, a magnetic or optical disk, etc.

The bus may be an Industry Standard Architecture (ISA) bus, a Peripheral Component Interconnect (PCI) bus, an Extended ISA (EISA) bus, or the like. The bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, the buses in the figures of the present application are not limited to only one bus or one type of bus.

The storage medium may be implemented by any type or combination of volatile or non-volatile memory devices, such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disks. A storage media may be any available media that can be accessed by a general purpose or special purpose computer.

Those of ordinary skill in the art will understand that: all or a portion of the steps of implementing the above-described method embodiments may be performed by hardware associated with program instructions. The program may be stored in a computer-readable storage medium. When executed, the program performs steps comprising the method embodiments described above; and the aforementioned storage medium includes: various media that can store program codes, such as ROM, RAM, magnetic or optical disks.

Finally, it should be noted that: the above embodiments are only used for illustrating the technical solutions of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the scope of the technical solutions of the embodiments of the present application.

Claims

1. An object extraction method, comprising:

2. The method according to claim 1, wherein the processing the layers of the first feature map and the object mask map according to a point of interest generation network to obtain at least one point of interest comprises:

setting values of non-object pixel points in the second feature map and pixel points with probability smaller than a first preset threshold value as interest points to be 0 according to the map layer of the object mask map and the first preset threshold value, and obtaining a processed second feature map;

3. The method according to claim 2, wherein the processing the first feature map and the at least one interest point according to an interest point control network and an instance selection network to obtain a target mask map corresponding to each of the at least one interest point comprises:

4. The method of claim 3, wherein the processing the first feature map and the at least one interest point according to the interest point control network to obtain a variance feature and a mean feature corresponding to each interest point respectively comprises:

5. The method according to claim 4, wherein the processing the first feature map, the variance feature, and the mean feature by the instance selection network to obtain a target mask map corresponding to each of the at least one interest point comprises:

6. The method according to any one of claims 1 to 5, wherein the determining the position of each object in the first picture according to the target mask map corresponding to each of the at least one interest point to extract the at least one object comprises:

7. The method of claim 1, wherein prior to inputting the first picture comprising the at least one object into the semantic segmentation feature network, the method further comprises:

8. The method of claim 1, further comprising:

9. The method of claim 8, wherein training the algorithm network according to multiple sets of sample data comprises:

10. The method of any one of claims 1-7, wherein the object is a shaft.

11. An object extraction apparatus, characterized by comprising:

12. An object extraction device, characterized by comprising:

a memory for storing a program;

a processor for executing the program stored by the memory, the processor being configured to perform the method of any of claims 1 to 10 when the program is executed.

13. A computer-readable storage medium comprising instructions which, when executed on a computer, cause the computer to perform the method of any one of claims 1 to 10.