CN115223035A

CN115223035A - Image detection method, device, equipment and storage medium

Info

Publication number: CN115223035A
Application number: CN202110426584.9A
Authority: CN
Inventors: 周强; 于超辉
Original assignee: Alibaba Singapore Holdings Pte Ltd
Current assignee: Alibaba Innovation Co
Priority date: 2021-04-20
Filing date: 2021-04-20
Publication date: 2022-10-21

Abstract

The embodiment of the invention provides an image detection method, an image detection device, image detection equipment and a storage medium, wherein the method comprises the following steps: the method comprises the steps of obtaining an image to be detected, carrying out instance segmentation processing on the image to be detected through an image segmentation model to obtain a mask image corresponding to a target object instance, and marking the target object instance in the image to be detected according to the mask image. The image segmentation model generates the mask image based on the target feature points, of which the importance degrees corresponding to the target object examples in the image to be detected meet the set requirements. And selecting only the target characteristic points which can represent the target object instance most, namely the target characteristic points with the importance degrees meeting the requirements, from the characteristic points contained in the characteristic diagram for each target object instance contained in the image to be detected, and directly finishing the segmentation of the target object instance based on the target characteristic points, thereby being beneficial to realizing the accurate and efficient segmentation of the target object instance.

Description

Image detection method, device, equipment and storage medium

Technical Field

The present invention relates to the field of image processing technologies, and in particular, to an image detection method, an image detection apparatus, an image detection device, and a storage medium.

Background

Instance segmentation is a fundamental task in the field of computer vision, and is a combination of target detection and semantic segmentation, aiming at predicting the location of each target object instance contained under each category in an image. For example, if the type of the target object instance is human, then everyone included in the image needs to be detected and distinguished.

In practical applications, target object instances to be segmented may present regular shapes or irregular shapes in different scenes, and when the target object instances present irregular shapes, the accuracy of the conventional instance segmentation scheme is often poor and the efficiency is low.

Disclosure of Invention

Embodiments of the present invention provide an image detection method, an image detection device, an image detection apparatus, and a storage medium, which can implement accurate and efficient detection processing on target object instances of arbitrary shapes in an image.

In a first aspect, an embodiment of the present invention provides an image detection method, where the method includes:

acquiring an image to be detected;

carrying out instance segmentation processing on the image to be detected through an image segmentation model to obtain a mask image corresponding to a target object instance;

marking the target object instance in the image to be detected according to the mask image;

the image segmentation model generates the mask image based on the target feature points, of which the importance degrees corresponding to the target object examples in the image to be detected meet the set requirements.

In a second aspect, an embodiment of the present invention provides an image detection apparatus, including:

the acquisition module is used for acquiring an image to be detected;

and the segmentation module is used for carrying out instance segmentation processing on the image to be detected through an image segmentation model so as to obtain a mask image corresponding to a target object instance, and marking the target object instance in the image to be detected according to the mask image.

In a third aspect, an embodiment of the present invention provides an electronic device, including: a memory, a processor; wherein the memory has stored thereon executable code which, when executed by the processor, causes the processor to implement at least the image detection method as described in the first aspect.

In a fourth aspect, an embodiment of the present invention provides a non-transitory machine-readable storage medium having stored thereon executable code, which when executed by a processor of an electronic device, causes the processor to implement at least the image detection method according to the first aspect.

In a fifth aspect, an embodiment of the present invention provides an image detection method, where the method includes:

receiving a request for calling an image detection service interface by user equipment, wherein the request comprises an image to be detected;

executing the following steps by utilizing the processing resource corresponding to the image detection service interface:

sending the image to be detected with the marking result to the user equipment;

In a sixth aspect, an embodiment of the present invention provides an image detection method, where the method includes:

acquiring a remote sensing image;

carrying out instance segmentation processing on the remote sensing image through an image segmentation model to obtain a mask image corresponding to a building;

marking the building in the remote sensing image according to the mask image;

the image segmentation model generates the mask image based on target feature points, of which the importance degrees corresponding to buildings in the remote sensing image meet set requirements.

In the image detection scheme provided by the embodiment of the invention, when a target object instance (such as each person included in a certain image) needs to be detected, instance segmentation processing is performed on the image to be detected through a preset image segmentation model to obtain a mask image corresponding to the target object instance, and then the target object instance is marked in the image to be detected according to the mask image corresponding to the target object instance to mark the corresponding position of the target object instance in the image to be detected. The image segmentation model firstly extracts the features of the image to be detected to obtain a feature map containing a plurality of feature points, and then determines target feature points corresponding to the target object example and having the importance degree meeting the set requirement from the feature map so as to generate a mask image corresponding to the target object example based on the model parameters corresponding to the target feature points.

In the above scheme, for each target object instance included in the image to be detected, only the target feature point which can represent the target object instance most, that is, the importance degree of the target feature point meets the requirement, is selected from the feature points included in the feature map, and the generation of the mask image of the target object instance is directly completed based on the model parameter corresponding to the target feature point, which is simple and efficient.

Compared with the traditional scheme of firstly detecting the rectangular candidate region (the rectangular region possibly containing the target object example) and then predicting the mask image, the detection of the rectangular candidate region is more suitable for the target object example with the regular shape, and the target object example with the irregular shape is unfriendly, but the scheme provided by the embodiment of the invention does not need to detect the rectangular candidate region, directly selects the target feature point with the importance degree corresponding to the target object example meeting the requirement, accurately performs image segmentation on the target object example by using the model parameter corresponding to the target feature point, and is more friendly to the target object example with the irregular shape.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on the drawings without creative efforts.

FIG. 1 is a diagram illustrating an example segmentation result according to an embodiment of the present invention;

FIG. 2 is a flowchart of an image detection method according to an embodiment of the present invention;

FIG. 3 is a flowchart of a mask image generation method according to an embodiment of the present invention;

FIG. 4 is a flowchart of another mask image generation method according to an embodiment of the present invention;

FIG. 5 is a schematic diagram illustrating an image segmentation model according to an embodiment of the present invention;

FIG. 6 is a flow chart of a model training process provided by an embodiment of the present invention;

FIG. 7 is a flowchart of another image detection method according to an embodiment of the present invention;

FIG. 8 is a schematic diagram of a remote sensing image cutting effect provided by an embodiment of the invention;

fig. 9 is a schematic diagram illustrating an application of an image detection method according to an embodiment of the present invention;

FIG. 10 is a flow chart of another image detection method according to an embodiment of the present invention;

FIG. 11 is a flowchart of another image detection method according to an embodiment of the present invention;

FIG. 12 is a flow chart of another image detection method according to an embodiment of the present invention;

FIG. 13 is a flow chart of another image detection method according to an embodiment of the present invention;

FIG. 14 is a flow chart of another image detection method according to an embodiment of the present invention;

fig. 15 is a schematic structural diagram of an image detection apparatus according to an embodiment of the present invention;

fig. 16 is a schematic structural diagram of an electronic device corresponding to the image detection apparatus provided in the embodiment shown in fig. 15.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be obtained by a person skilled in the art without inventive step based on the embodiments of the present invention, are within the scope of protection of the present invention.

In addition, the sequence of steps in the embodiments of the methods described below is merely an example, and is not strictly limited.

The image detection method provided by the embodiment of the invention can be executed by an electronic device, and the electronic device can be a terminal device such as a PC (personal computer), a notebook computer, a smart phone and the like, and can also be a server. The server may be a physical server including an independent host, or may also be a virtual server, or may also be a cloud server or a server cluster.

The image detection method provided by the embodiment of the invention mainly aims to: and performing instance segmentation on the image, namely detecting a target object instance contained in the image.

In a traditional scheme, neural network models such as Mask RCNN and the like can be adopted to realize an instance segmentation task, and the neural network models have a good effect on detecting object instances with regular shapes, but the effect is often not good when the object instances needing to be detected have the characteristic of irregular shapes. The idea of these schemes is to extract a plurality of rectangular candidate regions corresponding to a target object instance, then determine a better candidate region from the plurality of candidate regions by a Non-Maximum Suppression (NMS) algorithm, and complete the segmentation of the target object instance based on the finally determined candidate region, which not only takes time and has a large calculation amount, but also makes it difficult for the extracted rectangular candidate region to effectively and accurately cover the irregular target object instance, resulting in poor final detection accuracy.

Herein, a target object instance refers to each object corresponding to the same target class. For example, the object class is human, and each person contained in the image is an example of this class. As another example, the object class is a building, and each building contained in the image is an instance of this class.

For ease of understanding, the description is exemplified in conjunction with fig. 1.

In fig. 1, it is assumed that an original image to be detected includes two buildings, namely, a building a and a building B, and then two mask images illustrated in fig. 1 are obtained by performing example segmentation processing on the image to be detected, where the two mask images are respectively a mask image corresponding to the building a and a mask image corresponding to the building B. As can be seen from the example in the figure, the mask image corresponding to the building a is taken as an example, and in the mask image, the pixel value of each pixel inside the area where the building a is located is 1 (i.e., white), and the pixel value of each pixel outside the area is 0 (i.e., black). Thereafter, the position of the building a can be marked in the image to be detected based on the mask image of the building a, and the position of the building B can be marked in the image to be detected based on the mask image of the building B. Taking the mask image of the building a as an example, positioning each pixel with a pixel value of 1 in the mask image of the building a in the image to be detected, and the pixels are the position of the building a.

The following describes an exemplary implementation of the image detection method provided herein with reference to the following embodiments.

Fig. 2 is a flowchart of an image detection method according to an embodiment of the present invention, and as shown in fig. 2, the method includes the following steps:

201. and acquiring an image to be detected.

202. And carrying out instance segmentation processing on the image to be detected through an image segmentation model to obtain a mask image corresponding to the target object instance, wherein the image segmentation model generates the mask image based on target feature points, of which the importance degrees corresponding to the target object instance in the image to be detected meet the set requirement.

203. And marking the target object instance in the image to be detected according to the mask image.

In this embodiment, the image to be detected refers to an image that needs to perform target object instance detection, where the image to be detected may include one target object instance, may also include multiple target object instances, and may also not include a target object instance, and when the image to be detected includes at least one target object instance, a mask image (mask) corresponding to each target object instance needs to be output through an image segmentation model, and finally, each target object instance is marked in the image to be detected based on the mask image of each target object instance.

The images to be detected in different application scenes are different, and target object instances to be detected are also different. For example, in the unmanned aerial vehicle inspection scene, the image to be detected can be an image shot by the unmanned aerial vehicle, and the target object instances to be detected can be various buildings contained in the image. For another example, in an automatic driving scenario, the image to be detected may be an image captured by a camera on a vehicle, and the target object instances to be detected may be individuals and vehicles contained in the image.

The image segmentation model is a model for implementing an image segmentation task in the embodiment of the present invention, and a specific working process of the model will be described in detail in the following embodiments.

The marking of the target object instance in the image to be detected may be only marking a corresponding position (e.g., a contour) of the target object instance in the image to be detected, or may further mark related attribute information of the target object instance, such as: category name, coverage area, etc.

In the embodiment of the present invention, the image segmentation model specifically generates a mask image corresponding to the target object instance based on the target feature points, whose importance degrees corresponding to the target object instance in the to-be-detected image meet the setting requirements. In brief, a feature map corresponding to an image to be detected includes a plurality of feature points, and it is necessary to determine target feature points, whose importance degrees correspond to target object instances meet set requirements, from the feature points, and then generate a mask image corresponding to the target object instances based on model parameters corresponding to the target feature points. The process of generating the mask image will be described in detail with reference to the following examples.

Fig. 3 is a flowchart of an image detection method according to an embodiment of the present invention, and as shown in fig. 3, the method includes the following steps:

301. and performing feature extraction on the image to be detected to obtain a feature map.

302. And determining target characteristic points in the characteristic graph, and determining model parameters corresponding to the target characteristic points, wherein the target characteristic points are the characteristic points corresponding to the target object instance, and the importance degrees of the target characteristic points meet the set requirements.

303. And generating a mask image corresponding to the target object instance according to the model parameters corresponding to the feature map and the target feature point.

To complete the detection task of the target object instance, first, the image segmentation model needs to perform feature extraction processing on the image to be detected to obtain a corresponding feature map (feature map). The feature map includes a plurality of feature points, and it is assumed that the spatial resolution of a certain feature map is denoted by H × W, that is, this feature map includes H × W feature points. The characteristic points have a mapping relation with pixel points contained in the image to be detected.

In the field of image processing, a neural network model may be generally used to extract features of an image, and therefore, in the present embodiment, at least one convolution layer may be used in an image segmentation model to extract features of an image to be detected, which is not limited to this.

When the main structure of the image segmentation model is a convolution layer, the model parameters in the embodiment of the present invention are convolution kernel parameters. For convenience of description, the convolution kernel parameters are used as an example in the following description.

Optionally, in order to obtain better and abundant image features of the image to be detected in different scales, feature extraction of multiple scales may be performed on the image to be detected to obtain multiple feature maps corresponding to the multiple scales. In practical applications, feature extraction at these multiple scales can be achieved by multiple convolutional layers.

The network model for extracting features is called a backbone (backbone) network model, and in practical applications, the backbone network model may be, for example, VGG16, VGG19, gooLeNet, resNet50, resNet101, or the like.

The receptive field of the convolution layer of lower layers is smaller, the extracted feature map has higher spatial resolution, less semantic information, but more detailed information (such as edge feature information); the higher the receptive field of the convolution layer is, the smaller the spatial resolution of the extracted feature map is, the semantic information is rich, but the detail information is less.

After N feature maps corresponding to an image to be detected are obtained, N is larger than or equal to 1, and on one hand, target feature points with the importance degrees corresponding to target object examples meeting the set requirements are determined from a plurality of feature points contained in the feature maps; and on the other hand, the convolution kernel parameters of each feature map are predicted, namely the convolution kernel parameters corresponding to each feature point contained in each feature map are predicted, and one feature point corresponds to one convolution kernel parameter. In this way, after the target feature point is determined, the convolution kernel parameter corresponding to the target feature point can be determined from the convolution kernel parameters corresponding to the feature map to which the target feature point belongs.

For example, assuming that N =1, that is, only one feature map is extracted, and that K1 feature points are included in the feature map, the feature map may be input to a convolution kernel parameter prediction model (a part of an image segmentation model) to output convolution kernel parameters corresponding to the K1 feature points through the convolution kernel parameter prediction model. After the target feature point is determined from the K1 feature points, the convolution kernel parameter corresponding to the target feature point is selected from the K1 convolution kernel parameters.

Assuming that N =3, that is, three feature maps are extracted, and that K1, K2, and K3 feature points are included in the three feature maps, the three feature maps may be input into corresponding three convolution kernel parameter prediction models, respectively, so as to output convolution kernel parameters corresponding to the K1, K2, and K3 feature points through the three convolution kernel parameter prediction models, respectively. After the target feature point is determined from the K1+ K2+ K3 feature points, if the target feature point belongs to one of the K2 feature points, the convolution kernel parameter corresponding to the target feature point is selected from the corresponding K2 convolution kernel parameters. The structure of the convolution kernel parameter prediction model will be exemplified in the subsequent embodiments.

The meaning of the target feature point in the present embodiment is explained below. As described above, the target feature point is a feature point in the feature map whose degree of importance corresponding to the target object instance meets the setting requirement. When a plurality of feature maps are extracted, the target feature point is a feature point whose importance degree corresponding to the target object instance among all the feature points included in the plurality of feature maps meets the setting requirement.

Two points need to be explained here, first: the target feature point is a feature point corresponding to the target object instance, and means that whether the feature point in the feature map is classified into two categories, that is, whether the feature point corresponds to the target object instance, the feature point not corresponding to the target object instance cannot be used as the target feature point, and only the feature point corresponding to the target object instance is possible to be the target feature point. The characteristic point is mapped to the image to be detected, and the pixel point of the characteristic point mapped to the image to be detected is located in the image area where the target object instance is located. Secondly, the method comprises the following steps: among the plurality of feature points corresponding to the target object instance, only feature points whose importance levels meet the setting requirements may become target feature points.

The manner in which the importance of a feature point is defined may be various.

For example, a plurality of feature points corresponding to the target object instance may be mapped back to the image to be detected, pixel points corresponding to the feature points are determined in the image to be detected, a "central pixel point" is determined according to the positions of the pixel points, and the feature points corresponding to the central pixel point may be considered as feature points whose importance degrees meet the requirements. The "center pixel point" may be a clustering center obtained by clustering the mapped pixel points, and in practical application, there may be more than one "center pixel point", for example, the distribution characteristics of the pixel points present two center points, and then two "center pixel points" are determined. In fact, several "center pixel points" that is, several target feature points often mean that there are several target object instances in the image to be detected.

For another example, an "important feature point (or an important positive sample)" may be determined from a plurality of feature points corresponding to the target object instance, and the feature point corresponding to the "important feature point" is a feature point whose importance degree meets the requirement. The "important feature point" refers to a feature point that is further determined to be the most representative of the target object instance from among the plurality of feature points corresponding to the target object instance as a positive sample, and other feature points are all used as negative samples. The feature points corresponding to the target object instance are subjected to further two classifications, namely whether the feature points are the feature points which can represent the target object instance most, namely the importance degree is higher than other feature points. This method will be described in detail in the following examples.

After the target feature points and the convolution kernel parameters corresponding to the target feature points are obtained, the mask image corresponding to the target object instance can be generated according to the extracted feature map and the convolution kernel parameters corresponding to the target feature points.

It should be noted that the number of the target feature points means the number of target object instances included in the image to be detected, and when the number of the target feature points is multiple, the mask image of the corresponding target object instance needs to be generated in sequence according to the convolution kernel parameter and the feature map corresponding to each target feature point. In addition, when a plurality of feature maps are extracted, the feature map used for generating the mask image may be a feature map of which the scale meets the setting requirement, such as a feature map with the largest scale (i.e., the highest spatial resolution).

Assuming that feature maps with dimensions meeting set requirements in a plurality of feature maps are represented as a feature map X, assuming that the number of target feature points is 2, and convolution kernel parameters corresponding to the two target feature points are represented as W1 and W2 respectively, a mask image of one target object example can be obtained according to the convolution kernel parameters W1 and the feature map X, and a mask image of another target object example can be obtained according to the convolution kernel parameters W2 and the feature map X. Specifically, after some preprocessing is performed on the feature map X, a convolution operation may be performed on convolution kernel parameters corresponding to the target feature point to obtain a corresponding mask image, and a specific preprocessing process will be exemplarily described in a subsequent embodiment.

According to the introduction, for different images to be detected, because target object examples contained in the images to be detected are different, convolution kernel parameters for generating the mask image are dynamically changed, the target object examples are segmented through the dynamic convolution mode, an unfriendly scheme of firstly detecting rectangular candidate regions and then predicting the mask image is abandoned, and the method is more friendly to the target object examples with irregular shapes. In addition, for each target object instance contained in the image to be detected, only the target feature point which can represent the target object instance most, namely the target feature point with the importance degree meeting the requirement, is selected from the feature points contained in the feature map, the mask image of the target object instance is generated directly on the basis of the convolution kernel parameter corresponding to the target feature point, other post-processing operations are not needed, the accurate detection of the target object instance is facilitated, and the processing efficiency is improved.

The dynamic convolution (dynamic conv) means that the finally used convolution kernel parameter is not fixed, but generated on line and is related to the input image to be detected.

Fig. 4 is a flowchart of another mask image generation method according to an embodiment of the present invention, and as shown in fig. 4, the method includes the following steps:

401. and performing feature extraction of multiple scales on the image to be detected to obtain multiple feature maps corresponding to the multiple scales.

402. And performing classification prediction on the plurality of feature maps respectively to determine a first class confidence and a second class confidence corresponding to each feature point in the plurality of feature maps, wherein the first class is used for indicating whether the feature point corresponds to the target object example, and the second class is used for indicating whether the feature point is important in the plurality of feature points corresponding to the target object example.

403. And respectively determining a first total confidence corresponding to each feature point according to the first category confidence and the second category confidence corresponding to each feature point in the plurality of feature maps, and determining the feature point of which the first total confidence is greater than a set threshold value in each feature point as a target feature point.

404. And respectively carrying out convolution kernel parameter prediction on the plurality of feature maps to determine convolution kernel parameters corresponding to the feature points in the plurality of feature maps, and determining the convolution kernel parameters corresponding to the target feature points in the convolution kernel parameters corresponding to the feature map to which the target feature points belong.

405. And generating a mask image corresponding to the target object instance according to the feature map with the scale meeting the set requirement in the feature maps and the convolution kernel parameters corresponding to the target feature point.

For ease of understanding, the implementation of the scheme provided in this embodiment is described with reference to the image segmentation model architecture shown in fig. 5.

As shown in fig. 5, the system includes a feature extraction network model for extracting feature maps of various scales. Optionally, the feature extraction network model may be implemented as: a structure of a convolutional Network + Feature Pyramid Network (FPN), in which the convolutional Network serves as a backbone Network (backbone). The convolution network comprises a plurality of convolution layers, and after the image to be detected is input into the convolution network, characteristic diagrams of three scales of C3, C4 and C5 shown in the diagram can be sequentially extracted. The convolutional network is considered to be composed of five convolutional layers, two feature maps, namely C1 and C2, extracted by two convolutional layers at the lower layer are overlooked due to too high spatial resolution, and only feature maps of three scales, namely C3, C4 and C5, extracted by three convolutional layers at the upper layer are reserved. Then, a multi-scale feature fusion process is performed by FPN, and five-scale feature maps, P3 to P7, illustrated in fig. 5, are obtained (these feature maps in several scales are merely examples, and are not limited thereto). The working process of the FPN can be implemented by referring to the related art, which is not described herein. The spatial resolution of the feature maps of the five scales P3 to P7 decreases in order.

The multi-scale profiles referred to in step 401 may be the profiles P3-P7 illustrated in fig. 5.

In addition, the multiple scales are respectively corresponding to a branch network model, which is denoted as Head, and comprises a first classifier, a second classifier and a convolution kernel parameter prediction model.

The first classifier may be composed of a plurality of convolutional layers and a fully-connected layer for realizing classification output such as softmax.

As illustrated in fig. 5, the second classifier and the convolution kernel parameter prediction model may optionally share part of the model structure, such as the 4 convolution layers illustrated in the figure.

For a feature map Pi of a certain scale, the convolution kernel parameter prediction model corresponding to the feature map Pi is used for predicting convolution kernel parameters corresponding to each feature point in the feature map Pi, the first classifier corresponding to the feature map Pi is used for determining whether each feature point in the feature map Pi is a feature point corresponding to a target object instance, and the second classifier corresponding to the feature map Pi is used for determining whether each feature point in the feature map Pi is important in a plurality of feature points corresponding to the target object instance.

In practical application, the first classifier outputs a first class confidence corresponding to each feature point in the feature map Pi, where the first class is used to indicate whether the feature point corresponds to the target object instance. The second classifier outputs a second class confidence corresponding to each feature point in the feature map Pi, and the second class is used for indicating whether the feature point is important in the plurality of feature points corresponding to the target object instance.

In addition, as shown in fig. 5, the system further includes a mask branch (mask branch) corresponding to the feature map P3, which is used to complete the generation of the mask image corresponding to the target object instance. The reason why the mask prediction branch corresponds to the feature map P3 is that the feature map P3 is a feature map with the highest spatial resolution among the feature maps of multiple scales, and contains richer detail information, such as edge information and shape information, and better quality can be obtained by using the feature map P3 as the feature map generated by the mask image.

Of course, in practical applications, the mask prediction branch can also be implemented based on feature maps of other scales, such as the feature map P4. The choice of which scale feature map to use may be based on a trade-off between computational effort and mask image quality.

As shown in FIG. 5, a mask prediction branch may include a mask generator that includes a plurality of convolutional layers, the two inputs to the mask generator being: convolution kernel parameters corresponding to the target feature points and a feature map P3' obtained by processing the feature map P3. Assuming that the spatial resolution of the feature map P3 is H3 × W3 and there are 256-dimensional features, these processes include the feature dimension reduction process illustrated in fig. 5, for example, obtaining a feature map of H3 × W3 × 8, then calculating the position offset (x-direction position offset and y-direction position offset) of each feature point in the feature map with respect to the target feature point, and adding the position offsets of the two directions to each feature point in the feature map, thereby obtaining a feature map of H3 × W3 × 10, where the feature map of H3 × W3 × 10 is the feature map P3'.

Based on the framework illustrated in fig. 5, after multi-scale feature extraction is performed on an image to be detected to obtain feature maps P3 to P7, the corresponding feature maps are identified and processed based on the first classifier, the second classifier and the convolution kernel parameter prediction model corresponding to the feature map of each scale, and finally, the first category confidence coefficient, the second category confidence coefficient and the convolution kernel parameter corresponding to each feature point in the feature maps can be obtained.

For any feature point, the corresponding first class confidence degree reflects the probability that the feature point determined by the corresponding first classifier corresponds to the target object instance, and the corresponding second class confidence degree reflects the probability that the feature point determined by the corresponding second classifier is the important positive sample corresponding to the target object instance. The important positive sample means that the feature point is an important feature point in a plurality of feature points corresponding to the target object instance, and equivalently, the plurality of feature points corresponding to the target object instance are divided into two categories, namely an important positive sample and an important negative sample.

After the first category confidence and the second category confidence corresponding to each feature point in each feature map are obtained, the total confidence corresponding to each feature point is determined according to the first category confidence and the second category confidence corresponding to each feature point (which is different from the second total confidence hereinafter, and is referred to as the first total confidence). Alternatively, the first total confidence may be a product of the first category confidence and the second category confidence. And finally, determining the characteristic points of which the first total confidence degrees are greater than the set threshold value in all the characteristic points as target characteristic points. In practical application, the number of the target feature points with the first total confidence degree larger than the set threshold may be more than one, and the number of the target feature points reflects the number of target object instances contained in the image to be detected.

Because the convolution kernel parameters corresponding to each feature point in each feature map are determined before, after the target feature point is determined, the convolution kernel parameters corresponding to the target feature point can be directly obtained.

The convolution kernel parameters corresponding to the target feature points are input into a mask generator in the mask prediction branch, and a mask image corresponding to the target object instance can be generated based on the feature map P3'. It should be noted that, when the number of target feature points is multiple, assuming that there are a plurality of target feature points Z1 and a plurality of target feature points Z2, convolution kernel parameters corresponding to the plurality of target feature points need to be input to the mask generator, so that the mask generator can output one mask image corresponding to the target feature point Z1 based on the input convolution kernel parameter corresponding to the target feature point Z1, and can output another mask image corresponding to the target feature point Z2 based on the input convolution kernel parameter corresponding to the target feature point Z2.

In the scheme provided by this embodiment, by extracting feature maps of multiple scales and performing two kinds of classification recognition and convolution kernel parameter prediction processing in parallel on the feature maps of multiple scales, selection processing of a target feature point that can represent a target object instance most among feature points included in all the feature maps is performed, it is ensured that only one target feature point with the highest confidence coefficient is selected in each target object instance finally, a mask image with good quality can be directly obtained based on the target feature point corresponding to each target object instance, and the calculation is simple and efficient.

In the process of selecting the target feature points, the target feature points are not selected from the feature maps of each scale, but the feature maps of a plurality of scales are regarded as a whole, and the target feature points that meet the conditions are selected from all the feature maps. In fact, when there are a plurality of target feature points, the plurality of target feature points are included in the feature map of the same scale. From the viewpoint of this result, similarly to the currently input image to be detected, it is determined from the multiple feature maps which scale feature map needs to be used, and then the target feature point is selected from this feature map, but this is not actually performed.

In summary, in the scheme provided by the embodiment of the present invention, the mask image of each target object instance is directly predicted on the whole feature map by a dynamic convolution, so that the idea of predicting the rectangular frame region (bbox) including the target object instance before predicting the mask image in the conventional scheme is directly abandoned, and the prediction of the irregular target object instance is more friendly, and therefore, when the target object instance to be segmented in the image to be detected includes an object instance having an irregular shape, a better segmentation effect is obtained by using the scheme provided by the embodiment of the present invention. In addition, by using the selection of the important feature points, a representative feature point is directly output to each target object instance, so that the mask image of the corresponding target object instance is directly generated based on the convolution kernel parameters of the feature points, bbox NMS processing in the traditional scheme is not required, and the efficiency is improved.

After the mask image of each target object instance contained in the image to be detected is determined through the scheme provided by each embodiment, each target object instance can be marked in the image to be detected, so that a user can confirm or modify the detection result.

In the embodiment of the present invention, the target object instances of the same target category may be further divided into different types (sub-types) as needed, and when the target object instances of the same type are marked, the target object instances corresponding to the same type may be marked in the same marking manner. For example, if the target category is a man, and two types of men and women are further divided under the target category, if one man and two women are detected in the image to be detected, the pixel area corresponding to each of the two women may be filled with red, and the pixel area corresponding to the man may be filled with blue, for example.

In summary, when the number of target object instances in the image to be detected is multiple, multiple target object instances may be marked in the image to be detected according to the mask images corresponding to the multiple target object instances, where the marking modes corresponding to the target object instances of the same type in the multiple target object instances are the same.

Specifically, a mask image corresponding to a target object instance describes a pixel region (i.e., a position) corresponding to the target object instance in an image to be detected, that is, a corresponding contour of the target object instance in the image to be detected, the contour corresponding to the target object instance can be located in the image to be detected according to the mask image corresponding to the target object instance, and then the contour is correspondingly marked in a set marking manner, for example, a region defined by the contour is filled with a certain color, or the contour is drawn with a certain style of line, and so on.

It will be appreciated that the above object categories may be considered primary categories and the sub-types may be considered secondary categories, both primary and secondary categories being essentially classification tasks, and thus the meaning of the above first and second categories is essentially unchanged. The first category is whether any feature point is corresponding to a set category, and the second category is whether any feature point is an important feature point corresponding to a category.

Optionally, in addition to marking different target object examples contained in the image to be detected with different colors and lines as examples above, category names of the target object examples, such as men and women, may also be marked.

Based on the display of the marking result of each target object instance in the image to be detected, the user can confirm or modify the marking result and the like. For example, when the user finds that the marking result of a certain target object instance is inaccurate, such as the position of the contour boundary is inaccurate, the user may trigger a corresponding adjustment operation to adjust the position of the contour boundary to an accurate position. At this time, the adjustment behavior of the user may be collected as a basis for optimization of the image segmentation model.

In an optional embodiment, if the number of target object instances included in the image to be detected is large, if the marking results of all the target object instances are displayed on the interface, the display may be confusing, and at this time, only the marking results of a part of the target object instances may be displayed. Specifically, if the confidence corresponding to the mask image of a certain target object instance is lower than a set threshold, the target object instance is marked in the image to be detected. In this way, the user can determine the accuracy of the detection result of the target object instance based on the marking result, and make a corresponding modification operation when the marking result is inaccurate.

It can be understood that, in the process of calculating the mask image of the target object instance based on the convolution kernel parameters of the target feature point corresponding to the target object instance and the feature map of a certain scale, the confidence corresponding to the mask image may be output at the same time. In an alternative embodiment, the confidence level may be directly the first total confidence level of the target feature point, and may be limited to this, for example, the confidence level may be automatically output by a mask generator as described below.

The following describes a training process of the image segmentation model illustrated in fig. 5.

Fig. 6 is a flowchart of a model training process according to an embodiment of the present invention, and as shown in fig. 6, the training process includes the following steps:

601. and performing feature extraction of multiple scales on the training sample image to obtain multiple feature maps corresponding to the multiple scales.

602. For any one feature map, determining a first class confidence corresponding to each feature point in the feature map through a first classifier, determining a second class confidence corresponding to each feature point in the feature map through a second classifier, and determining a convolution kernel parameter corresponding to each feature point in the feature map through a convolution kernel parameter prediction model.

603. And generating a predicted mask image corresponding to each feature point in any feature map through a mask generator according to the convolution kernel parameters corresponding to each feature point in any feature map and the feature maps with the scales meeting the set requirements in the feature maps.

604. And determining the mask confidence corresponding to each feature point in any feature map according to the reference mask image and the prediction mask image corresponding to each feature point in any feature map.

605. And determining a second total confidence corresponding to each feature point in any feature map according to the first category confidence and the mask confidence corresponding to each feature point in any feature map.

606. Determining the following supervision information of a second classifier according to the second total confidence corresponding to each feature point in any feature map: and the feature point with the highest second total confidence coefficient in any feature map is an important feature point, and other feature points are non-important feature points, so that the second classifier is trained according to the supervision information and the second class confidence coefficient.

The present embodiment mainly describes a training process of the second classifier, and mainly describes a generation process of the supervisory information of the second classifier. In summary, the corresponding supervision information of the second classifier is determined based on the output result of the first classifier and the output result of the mask generator, and is not manually labeled in advance.

The training process for the first classifier and the mask generator is simpler than for the second classifier.

Wherein, the supervision information for training the first classifier is as follows: and training pixel labeling results corresponding to the object instances and not corresponding to the object instances in the sample images. Still, if two persons are included in a training sample image, the pixels included in the pixel areas corresponding to the two persons in the training sample image correspond to the category label T1, and the pixels included in the other pixel areas correspond to the category label T2. Wherein, T1 represents that the pixel is the pixel on the object instance, and T2 represents that the pixel is not the pixel on the object instance.

And training the first classifier according to the supervision information corresponding to the first classifier and the output first class confidence coefficient. In the case of performing multi-scale feature extraction on a training sample image, since a first classifier corresponds to a feature map of each scale, and training processes of different first classifiers are similar, the training process of the first classifier corresponding to a feature map of any scale is only described as an example below.

Taking a feature map Pi with a certain scale as an example, determining a loss function corresponding to the corresponding first classifier according to the class label corresponding to each feature point in the feature map Pi and the predicted first class confidence of each feature point, and adjusting the model parameters of the first classifier based on the loss function to realize the training of the first classifier. It can be understood that the feature points in the feature map Pi obtained after feature extraction have a mapping relationship with the pixel points in the original training sample image, and based on the mapping relationship and the class label labeling result of the pixels in the training sample image, the class labels corresponding to the feature points in the feature map Pi can be obtained. The first category confidence reflects the probability of the feature point corresponding to the category label T1 and the category label T2.

Wherein, the supervision information of the training mask generator is as follows: a reference mask image corresponding to an object instance in the training sample image. Assuming that the object example is a person, and two persons are included in a training sample image, the training sample image may have two reference mask images corresponding to the two persons. In practical application, a user can mark a pixel region corresponding to each object instance, and generate a corresponding reference mask image based on the marking result.

The mask generator is trained based on the corresponding supervisory information (i.e., reference mask image) for the mask generator and its output mask image (referred to as a predicted mask image for distinction).

Still taking the above characteristic diagram Pi as an example, the training process of the second classifier and the mask generator is explained.

And inputting the feature map Pi into a corresponding first classifier, wherein the first classifier outputs a first class confidence coefficient corresponding to each feature point in the feature map Pi, and the first class confidence coefficient reflects the probability of the feature point corresponding to the class label T1 and the class label T2.

And inputting the feature map Pi into a corresponding second classifier, and outputting a second class confidence corresponding to each feature point in the feature map Pi by the second classifier. The second classifier is also used for implementing a classification task, and two class labels corresponding to the classification task are represented as a class label T3 and a class label T4, where the class label T3 represents that a feature point is an important feature point (or an important positive sample), and the class label T4 represents that the feature point is not an important feature point. As described above, the meaning of the significant positive example means that the feature point is a significant one of a plurality of feature points corresponding to the object instance. Based on this, the second category confidence reflects the probability magnitude that the feature point corresponds to the category label T3, T4.

And then, after obtaining the convolution kernel parameters corresponding to the feature points in the feature map Pi through the convolution kernel parameter prediction model, sequentially inputting the convolution kernel parameters corresponding to the feature points into a mask generator, so that the mask generator can sequentially output the predicted mask image corresponding to each feature point. Assuming that the feature map Pi contains M feature points, M predicted mask images will be output.

For any one of the feature points Qi, assuming that the predicted mask image corresponding to the feature point Qi is represented as Mi, after the reference mask image X corresponding to the feature point Qi is determined, a region overlapping degree (i.e., ioU for short) of the predicted mask image Mi and the reference mask image X may be calculated, and the region overlapping degree is used as a mask confidence corresponding to the feature point Qi.

When the reference mask image is generated, the corresponding relation between the position of each pixel point corresponding to each object instance and the reference mask image is stored, the characteristic points in the characteristic graph and the pixel points in the training sample image have a set mapping relation, and the reference mask image X corresponding to the characteristic points Qi can be obtained based on the mapping relation and the corresponding relation.

Thus, the mask confidence and the first category confidence corresponding to the feature point Qi are obtained, and the second total confidence corresponding to the feature point Qi is determined according to the first category confidence and the mask confidence corresponding to the feature point Qi. Alternatively, the second total confidence may be a product of the first class confidence corresponding to the feature point Qi and the mask confidence.

Through the above process, a second total confidence corresponding to each feature point in the feature map Pi can be obtained. The same processing is performed on other feature maps, so that a second total confidence corresponding to each feature point in all the feature maps can be obtained.

Then, determining the following supervision information corresponding to the second classifier according to the second total confidence corresponding to each feature point in all the feature maps: the feature point with the highest second overall confidence coefficient is an important feature point, and the other feature points are non-important feature points. That is, the feature point with the highest second overall confidence level is associated with the category label T3, and the other feature points are associated with the category labels T4.

Therefore, based on the supervision information corresponding to the second classifier and the second class confidence corresponding to each feature point actually output by the second classifier, the loss function of the second classifier can be calculated, and the training of the second classifier is realized. For the mask generator, the loss function may be calculated based on the region overlapping degrees of the predicted mask image and the reference mask image corresponding to the feature points output by the mask generator, so as to train the mask generator with the goal of maximizing the region overlapping degree of the predicted mask image and the corresponding reference mask image corresponding to the feature point with the highest second total confidence as the target.

Fig. 7 is a flowchart of another image detection method according to an embodiment of the present invention, and as shown in fig. 7, the method may include the following steps:

701. and acquiring a remote sensing image to be detected.

702. And cutting the remote sensing image based on the image cutting operation of the user or based on a preset image size to obtain a plurality of images.

703. And respectively carrying out instance segmentation processing on the plurality of images through an image segmentation model to obtain a mask image corresponding to the target object instance.

704. And marking a target object instance in the remote sensing image according to the mask image.

In this embodiment, it is assumed that the to-be-detected image to be subjected to case segmentation is a remote sensing image obtained by a remote sensing satellite. In fact, the shooting range corresponding to the remote sensing image is often relatively wide, for example, the size of the remote sensing image is relatively large corresponding to the whole city. In order to facilitate efficient instance segmentation processing of a remote sensing image with a relatively large size, the remote sensing image may be cut in advance to obtain a plurality of images with relatively small sizes, and the plurality of images are obtained as a result of the cutting.

Alternatively, the cutting of the remote sensing image may be automatically performed according to a preset image size, such as 1024 × 1024.

Or, optionally, the cutting of the remotely sensed image may also be the result of a manual cutting by the user. As shown in fig. 8, a user loads and displays a remote sensing image to be detected through a certain computing device, and a display interface may display a plurality of image cutting operation items, such as cropping, lines, frame selection, rotation, and the like, illustrated in fig. 8. And the user realizes the cutting of the remote sensing image through the operation items. It can be understood that, because the user tries to ensure that the same target object instance is completely contained in one image and is not split into different images, the manual cutting result of the user does not necessarily obtain multiple images with the same size.

Moreover, the user can autonomously cut out a plurality of images with proper sizes which possibly contain target object instances from the remote sensing images according to the instance segmentation task. For example, if the remote sensing image corresponds to a city, and the example segmentation task is to detect each building contained in the remote sensing image, the user can filter out image areas unlikely to contain buildings in the remote sensing image in advance according to the visual features of the buildings, such as rivers, mountains and the like, and cut the remaining image areas, so that the calculation amount of the subsequent image segmentation model can be reduced.

As shown in fig. 8, for the task of segmenting the building, the user may omit a tree planting area in the remote sensing image, and only cut other image areas that may include the building, to obtain an image F1 and an image F2.

Then, the plurality of cut images are input to the image segmentation model. For any image i in the plurality of images, the image segmentation model generates a mask image corresponding to the target object example contained in the image i on the basis of the target feature points, the importance degrees of which correspond to the target object example in the image i meet the set requirement, in the image i. Then, based on the mask image, a corresponding target object instance can be marked in the image i, and further, based on the corresponding position of the image i in the complete remote sensing image, the image i marked with the target object instance can be restored to the complete remote sensing image, so that a marking result of the corresponding target object instance is obtained in the remote sensing image.

As described above, the image detection method provided by the present invention can be executed in the cloud, and a plurality of computing nodes may be deployed in the cloud, and each computing node has processing resources such as computation and storage. In the cloud, a plurality of computing nodes may be organized to provide a service, and of course, one computing node may also provide one or more services. The way that the cloud provides the service may be to provide a service interface to the outside, and the user calls the service interface to use the corresponding service. The service Interface includes Software Development Kit (SDK), application Programming Interface (API), and other forms.

According to the scheme provided by the embodiment of the invention, the cloud end can be provided with a service interface of the image detection service, and the user calls the image detection service interface through the user equipment so as to trigger a request for calling the image detection service interface to the cloud end. The cloud determines the compute nodes that respond to the request, and performs the following steps using processing resources in the compute nodes:

carrying out instance segmentation processing on an image to be detected through an image segmentation model to obtain a mask image corresponding to a target object instance, wherein the image segmentation model generates the mask image based on target feature points, of which the importance degrees corresponding to the target object instance in the image to be detected meet the set requirements;

marking a target object example in an image to be detected according to the mask image;

and sending the image to be detected with the marking result to the user equipment.

For the detailed process of the image detection service interface using the processing resource to execute the image detection processing, reference may be made to the related description in the foregoing other embodiments, which is not described herein again.

For ease of understanding, the description is illustrative with reference to FIG. 9. In fig. 9, when a user wants to perform instance segmentation processing on an image to be detected to detect a certain target object instance contained in the image to be detected, an image detection service interface is called in a user device E1 to send a call request to a cloud computing node E2, where the call request includes the image to be detected and may also include category information corresponding to the target object instance to be detected. The calling method of the image detection service interface illustrated in fig. 9 is as follows: the user uses specific APP, an uploading button is arranged on a certain interface of the APP, the user loads the image to be detected on the interface, and after the uploading button is clicked, the calling request is triggered. That is to say, the APP is a client program that provides the image detection service at the cloud, and the upload button in the program is an application program interface that calls the service. After the image to be detected is loaded into the original image to be detected, the user can also perform editing operation, such as the cutting operation described above, on the image to be detected by using various image editing tools provided under the "image editing" menu.

In this embodiment, it is assumed that after receiving the call request, the cloud computing node E2 knows which type of object instance included in the image to be detected needs to be detected based on the type information, and then executes the detection process, which refers to the description in the foregoing embodiment, and is not described herein again. And then, the cloud computing node E2 sends the final detection result, namely the marking result of the target object instance in the image to be detected, to the user equipment E1, and the user equipment E1 displays the image to be detected with the marking result on an interface.

When a user finds that a marking result corresponding to a certain target object instance in the image to be detected is wrong, the marking result can be manually adjusted, and the adjustment information can be fed back to the computing node E2 through the user equipment E1, so that the computing node E2 can optimize the image segmentation model according to the adjustment information.

In addition, optionally, one or more operation items may also be displayed on the user equipment E1 for processing the marking result, such as the operation item related to the adjustment operation described above. Besides, for example, some operation items related to the application purpose may be included, such as operation items of quantity statistics, area calculation, and the like illustrated in fig. 9. And the quantity statistics is used for counting the quantity of the target object examples corresponding to the same target category contained in the image to be detected. And (4) area calculation, which is used for calculating the coverage area (referring to the coverage area in the physical space) of each target object instance.

After the user selects a certain operation item as required, the user equipment E1 may locally complete the related operation and display the operation result, or may want the computing node E2 to send a corresponding operation request, so that the computing node E2 completes the related operation and feeds back the operation result to the user equipment E1 for display.

In practical application, the problem of instance segmentation may be involved in many application fields, and the technical solution of the embodiment of the present invention may be used.

For example, in order to ensure the safety of residents, a building cannot be arranged beside a river channel, and in order to check whether a building is arranged beside a river channel, an environment around the river channel can be shot by an unmanned aerial vehicle or a remote sensing image corresponding to the environment around the river channel can be obtained in a remote sensing mode and is used as an image to be detected, and the image to be detected is subjected to instance segmentation of the building to obtain a detection result.

Fig. 10 is a flowchart of another image detection method according to an embodiment of the present invention, and as shown in fig. 10, the method may include the following steps:

1001. and acquiring a remote sensing image.

1002. And carrying out example segmentation processing on the remote sensing image through an image segmentation model to obtain a mask image corresponding to the building, wherein the image segmentation model generates the mask image based on target feature points, of which the importance degrees corresponding to the building in the remote sensing image meet the set requirement, in the remote sensing image.

1003. And marking the building in the remote sensing image according to the mask image.

1004. Building record information is generated.

In this embodiment, the remote sensing image may be an image acquired by a remote sensing satellite, such as an image of the surroundings of a river. The process of detecting each building included in the remote sensing image to generate the mask image corresponding to each building may refer to the related description in the foregoing other embodiments, and is not described herein again.

It will be appreciated that if the remote sensing image does not contain buildings, the mask image will be generated as a null, and if the remote sensing image contains N buildings, N mask images will be generated, N being greater than or equal to 1.

Based on the generated mask images of the buildings, after the corresponding buildings are marked in the remote sensing image, the information such as the height, the coverage area and the like of each building can be further calculated. And further determining the position information of each building based on the position information corresponding to the acquired remote sensing images. The building record information may include information such as a coverage area, a height, and a position of the building.

In this embodiment, since the size of the remote sensing image may be relatively large, the image cutting process may be performed in advance, and the cutting process refers to the related description in the foregoing embodiment. And after the example segmentation processing of the building is carried out on each image obtained by cutting, the example segmentation results of each image obtained by cutting are summarized in the complete remote sensing image.

Fig. 11 is a flowchart of another image detection method according to an embodiment of the present invention, and as shown in fig. 11, the method may include the following steps:

1101. and acquiring a commodity image.

1102. And carrying out example segmentation processing on the commodity image through an image segmentation model to obtain a mask image corresponding to the commodity, wherein the image segmentation model generates the mask image based on target feature points, corresponding to the commodity, of which the importance degrees meet set requirements in the commodity image.

1103. And marking the commodity in the commodity image according to the mask image.

1104. And counting the number of commodities.

In an e-commerce scene, in practical application, vending machines are arranged in areas such as shopping malls and stations, and the goods shortage condition of the vending machines needs to be found in time, and goods are replenished in time. Through the scheme provided by the embodiment of the invention, the counting of the quantity of various residual commodities in the vending machine can be realized.

Assuming that the same type of goods are placed on the same layer of shelf of the vending machine, a camera capable of shooting the layer of shelf can be arranged above each layer of shelf of the vending machine, and an image shot by the camera is called a goods image. The commodity image comprises a plurality of commodities, and the commodities are a plurality of target object instances which need to be segmented. The method comprises the steps of obtaining mask images corresponding to a plurality of commodities by carrying out example segmentation processing on commodity images, and obtaining the residual quantity of the commodities after marking of corresponding commodities in the commodity images according to the mask images.

Fig. 12 is a flowchart of another image detection method according to an embodiment of the present invention, and as shown in fig. 12, the method may include the following steps:

1201. and acquiring a road image.

1202. And carrying out example segmentation processing on the road image through an image segmentation model to obtain a mask image corresponding to the traffic object example, wherein the image segmentation model generates the mask image based on target feature points, of which the importance degrees corresponding to the traffic object example in the road image meet the set requirement.

1203. And marking the road traffic instance in the road image according to the mask image.

1204. The travel control information is generated.

The scheme provided by the embodiment can be applied to an automatic driving scene. Since the vehicle needs to perform the running control according to the surrounding vehicles and pedestrians while the vehicle is running, it is necessary to detect each person, vehicle, and the like existing around.

The camera can be arranged in front of the vehicle and is used for collecting a road image in front of the vehicle as an image to be detected so as to perform segmentation processing on the traffic object examples of the image to be detected, generate mask images corresponding to all the traffic object examples, and complete marking of corresponding road traffic examples in the road image according to the mask images. The traffic object instance includes objects such as individuals and vehicles.

Based on the detection result, when it is found that there is a vehicle in the short distance ahead, the vehicle may be controlled to travel at an appropriate speed, and when it is found that there is a pedestrian in the short distance ahead, the vehicle may be controlled to decelerate or stop to avoid the pedestrian.

Fig. 13 is a flowchart of another image detection method according to an embodiment of the present invention, and as shown in fig. 13, the method may include the following steps:

1301. and acquiring a teaching image.

1302. And carrying out example segmentation processing on the teaching image through an image segmentation model to obtain a mask image corresponding to the teaching object example, wherein the image segmentation model generates the mask image based on target feature points, of which the importance degrees corresponding to the teaching object example in the teaching image meet the set requirement.

1303. And marking the teaching object example in the teaching image according to the mask image.

In the education scene, the teacher may use demonstration instruments such as blackboard-writing, PPT at the in-process of giving lessons, and the classmate also can use paper, textbook etc. can shoot these demonstration instruments, paper, textbook etc. and obtain the teaching image, when the classmate shoots a large amount of teaching images, face the follow-up demand of classifying arrangement and searching as required to a large amount of teaching images.

The teaching image may include various knowledge points embodied in tree-like relationship, such as two subtitles juxtaposed under a certain headline. For example, the main title is "trigonometric function", and the sub-titles include "sine", "cosine", "tangent", etc. In this embodiment, the teaching object instance may correspond to different knowledge point titles, and the position of the required knowledge point title included in the teaching image is detected by performing instance segmentation on the knowledge point title included in the teaching image, which is reflected by the corresponding mask image, and then marked in the teaching image, so that teachers and classmates can conveniently retrieve the teaching image including the required knowledge point title.

Fig. 14 is a flowchart of another image detection method according to an embodiment of the present invention, and as shown in fig. 14, the method may include the following steps:

1401. a medical image is acquired.

1402. And carrying out example segmentation processing on the medical image through an image segmentation model to obtain a mask image corresponding to the example of the treatment object, wherein the image segmentation model generates the mask image based on target feature points, of which the importance degrees corresponding to the example of the treatment object in the medical image meet the set requirements.

1403. And marking the example of the treatment object in the medical image according to the mask image.

In the medical field, a large number of medical images (for example, medical images taken by various examination instruments) can be generated, and when an organization needs to perform statistics and analysis on a certain disease, segmentation processing of an example of a treatment object corresponding to the disease can be performed on each collected medical image to obtain a medical image corresponding to the disease. Moreover, the example of the treatment object can be marked on the medical image, so that the user can conveniently observe the medical image in a positioning mode. In this embodiment, the example of the treatment target may be a lesion site corresponding to the disease.

The application scenarios to which the image detection scheme provided by the embodiment of the present invention can be applied are illustrated above by taking only several application fields as examples, and actually, the present invention is not limited thereto.

An image detection apparatus according to one or more embodiments of the present invention will be described in detail below. Those skilled in the art will appreciate that these image detection devices can be constructed using commercially available hardware components configured by the steps taught in the present scheme.

Fig. 15 is a schematic structural diagram of an image detection apparatus according to an embodiment of the present invention, and as shown in fig. 15, the apparatus includes: the device comprises an acquisition module 11 and a segmentation module 12.

And the acquisition module 11 is used for acquiring an image to be detected.

And the segmentation module 12 is configured to perform instance segmentation processing on the image to be detected through an image segmentation model to obtain a mask image corresponding to a target object instance, and mark the target object instance in the image to be detected according to the mask image.

Optionally, the image to be detected is a remote sensing image, and the segmentation module 12 may be specifically configured to: cutting the remote sensing image based on image cutting operation of a user or based on a preset image size to obtain a plurality of images; and respectively carrying out instance segmentation processing on the plurality of images through the image segmentation model to obtain a mask image corresponding to the target object instance.

Optionally, the target object instance comprises an object instance having an irregular shape.

Optionally, the apparatus further comprises: and the optimization module is used for responding to the adjustment operation of the marking result by the user and optimizing the image segmentation model according to the adjustment operation.

Optionally, the segmentation module 12 may be specifically configured to: extracting the features of the image to be detected to obtain a feature map; determining the target feature points in the feature map; determining model parameters corresponding to the target feature points; and generating a mask image corresponding to the target object instance according to the feature map and the model parameters.

Optionally, the segmentation module 12 may be specifically configured to: carrying out feature extraction on an image to be detected in multiple scales to obtain multiple feature maps corresponding to the multiple scales; respectively carrying out classification prediction on the plurality of feature maps to determine a first class confidence coefficient and a second class confidence coefficient corresponding to each feature point in the plurality of feature maps, wherein the first class is used for indicating whether the feature point corresponds to the target object example, and the second class is used for indicating whether the feature point is important in the plurality of feature points corresponding to the target object example; respectively determining the total confidence corresponding to each feature point according to the first category confidence and the second category confidence corresponding to each feature point; determining the characteristic points with the total confidence degrees larger than a set threshold value in all the characteristic points as target characteristic points; determining model parameters corresponding to the target feature points; and generating a mask image corresponding to the target object instance according to the feature maps with the scales meeting the set requirements in the feature maps and the model parameters.

The multiple scales respectively correspond to a first classifier and a second classifier, and the first class confidence coefficient and the second class confidence coefficient are respectively output by the first classifier and the second classifier; the mask image is output by a mask generator.

Wherein, the supervision information for training the mask generator is as follows: a reference mask image corresponding to an object instance in the training sample image. The supervisory information for training the first classifier is: and marking the pixel in the training sample image corresponding to the object instance and the pixel not corresponding to the object instance. Supervisory information to train the second classifier is determined from the output of the first classifier and the output of the mask generator.

Optionally, the apparatus further comprises: and a training module. In the training process of the second classifier, the training module is specifically configured to:

carrying out feature extraction on the training sample image in multiple scales to obtain multiple feature maps corresponding to the multiple scales;

for any one feature map, determining a first class confidence corresponding to each feature point in the feature map through the first classifier, determining a second class confidence corresponding to each feature point in the feature map through the second classifier, and determining a convolution kernel parameter corresponding to each feature point in the feature map;

respectively generating a prediction mask image corresponding to each feature point in any feature map according to the convolution kernel parameters corresponding to each feature point in any feature map and the feature maps with the scales meeting the set requirements in the feature maps;

determining the mask confidence corresponding to each feature point in any feature map according to the reference mask image and the prediction mask image corresponding to each feature point in any feature map;

determining a second total confidence corresponding to each feature point in any feature map according to the first category confidence and the mask confidence corresponding to each feature point in any feature map;

determining the following supervision information according to the second total confidence corresponding to each feature point in the plurality of feature maps: the feature point with the highest second total confidence coefficient is an important feature point, and other feature points are non-important feature points;

and training the second classifier according to the supervision information and the second class confidence.

The apparatus shown in fig. 15 can execute the image detection method provided in the foregoing embodiment, and the detailed execution process and technical effect refer to the description in the foregoing embodiment, which is not described herein again.

In one possible design, the structure of the image detection apparatus shown in fig. 15 may be implemented as an electronic device, which may include: a processor 21 and a memory 22. Wherein the memory 22 has stored thereon executable code which, when executed by the processor 21, makes the processor 21 at least capable of implementing the image detection method as provided in the preceding embodiments.

Optionally, the electronic device may further include a communication interface 23 for communicating with other devices.

In addition, an embodiment of the present invention provides a non-transitory machine-readable storage medium having stored thereon executable code, which, when executed by a processor of an electronic device, causes the processor to implement at least the image detection method as provided in the foregoing embodiments.

The above-described apparatus embodiments are merely illustrative, wherein the units described as separate components may or may not be physically separate. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.

Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by adding a necessary general hardware platform, and of course, can also be implemented by a combination of hardware and software. With this understanding in mind, the above-described aspects and portions of the present technology which contribute substantially or in part to the prior art may be embodied in the form of a computer program product, which may be embodied on one or more computer-usable storage media having computer-usable program code embodied therein, including without limitation disk storage, CD-ROM, optical storage, and the like.

Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. An image detection method, comprising:

acquiring an image to be detected;

2. The method of claim 1, wherein the target object instance comprises an object instance having an irregular shape.

3. The method of claim 1, further comprising:

and responding to the adjustment operation of the user on the marking result, and optimizing the image segmentation model according to the adjustment operation.

4. The method according to claim 1, wherein the number of the target object instances is multiple, and the marking modes corresponding to the target object instances of the same type in the multiple target object instances are the same.

5. The method of claim 1, wherein the instance splitting process comprises the steps of:

extracting the features of the image to be detected to obtain a feature map;

determining the target feature points in the feature map;

determining model parameters corresponding to the target feature points;

and generating a mask image corresponding to the target object instance according to the feature map and the model parameters.

6. The method of claim 5, wherein the instance splitting process comprises the steps of:

carrying out feature extraction on an image to be detected in multiple scales to obtain multiple feature maps corresponding to the multiple scales;

respectively carrying out classification prediction on the plurality of feature maps to determine a first class confidence coefficient and a second class confidence coefficient corresponding to each feature point in the plurality of feature maps, wherein the first class is used for indicating whether the feature point corresponds to the target object example, and the second class is used for indicating whether the feature point is important in the plurality of feature points corresponding to the target object example;

respectively determining the total confidence corresponding to each feature point according to the first category confidence and the second category confidence corresponding to each feature point;

determining the characteristic points with the total confidence degrees larger than a set threshold value in all the characteristic points as target characteristic points;

determining model parameters corresponding to the target feature points;

and generating a mask image corresponding to the target object instance according to the feature maps with the scales meeting the set requirements in the feature maps and the model parameters.

7. An image detection method, comprising:

sending the image to be detected with the marking result to the user equipment;

8. An image detection method, comprising:

acquiring a remote sensing image;

marking the building in the remote sensing image according to the mask image;

9. The method of claim 8, wherein said performing instance segmentation processing on said remotely sensed image via an image segmentation model comprises:

cutting the remote sensing image based on image cutting operation of a user or based on a preset image size to obtain a plurality of images;

and respectively carrying out instance segmentation processing on the plurality of images through the image segmentation model to obtain a mask image corresponding to the target object instance.

10. An image detection apparatus, characterized by comprising:

the acquisition module is used for acquiring an image to be detected;

11. An electronic device, comprising: a memory, a processor; wherein the memory has stored thereon executable code which, when executed by the processor, causes the processor to perform the image detection method of any one of claims 1 to 6.

12. A non-transitory machine-readable storage medium having stored thereon executable code, which when executed by a processor of an electronic device, causes the processor to perform the image detection method of any one of claims 1 to 6.