WO2021051601A1

WO2021051601A1 - Method and system for selecting detection box using mask r-cnn, and electronic device and storage medium

Info

Publication number: WO2021051601A1
Application number: PCT/CN2019/118279
Authority: WO
Inventors: 陈欣
Original assignee: 平安科技（深圳）有限公司
Priority date: 2019-09-19
Filing date: 2019-11-14
Publication date: 2021-03-25
Also published as: CN110738125B; CN110738125A

Abstract

A method and system for selecting a detection box using a Mask R-CNN, and an electronic device and a storage medium, relating to the technical field of image recognition. The method comprises: performing instance segmentation on a target image using a Mask R-CNN, and obtaining a rectangular candidate detection box and a polygonal contour corresponding to the candidate detection box (S110); respectively calculating IOU values of the candidate detection box and the polygonal contour, and when the IOU value of the candidate detection box is greater than a first preset threshold IOU₁ and the IOU value of the polygonal contour is greater than a second preset threshold IOU₂, screening out the candidate detection box as a target detection box, wherein the second preset threshold IOU₂ is greater than the first preset threshold IOU₁ (S120). By means of IOU secondary screening of the polygonal contour, the detection precision of the detection box is improved.

Description

Method and system for selecting detection frame using Mask R-CNN, electronic device and storage medium

This application requires the priority of a patent application whose application number is 201910885674.7, the application date is September 19, 2019, and the invention-creation title is "Method, device and storage medium for selecting a detection frame using Mask R-CNN".

Technical field

This application relates to the field of image recognition technology, and in particular to a method and system for selecting a detection frame using Mask R-CNN, an electronic device, and a storage medium.

Background technique

Video-based detection and tracking of moving human bodies are widely used in the surveillance of crowded places with high safety requirements, such as banks and railway stations. Human tracking in real-time scenes is more complicated, and there are other interfering factors such as background changes and occlusions, and it is difficult to meet the requirements of detection accuracy, robustness, and real-time at the same time.

The applicant realizes that the current human body detection and tracking method is implemented through a rectangular search box, and the disadvantages are as follows:

1. The search box evaluates the detection results through the IOU, even if the search box meets the IOU index, there are still interfering images;

2. At present, the detection target classification of the search box is limited to large categories, such as humans or animals; for detailed classification, such as male and female or old and young cannot be further distinguished;

3. When detecting the human body in a complex background, it is greatly affected by the surrounding environment; for example, when the color of the clothes worn by the pedestrian is similar to the background coloring or the background light changes greatly, it is difficult to segment the moving human body from the background;

4. When there are "shadows" and "mirrors" in the scene, it increases the complexity of the features in the search box, interferes with the detection of the search box, and causes "the person in the mirror is like a person" or "the shadow area is a person" Misjudgment; or the presence of moving objects in the scene, such as cars or swaying trees, or fluctuating water surfaces will increase the complexity of the features in the search box and increase the difficulty of detection.

In view of the existence of the above problems, there is an urgent need for a target detection method that better eliminates interference and distinguishes false targets and performs more detailed classification.

Summary of the invention

This application provides a method and system for selecting a detection frame using Mask R-CNN, an electronic device, and a computer-readable storage medium. It mainly obtains a rectangular frame of a target and a set of polygonal contour points through an instance segmentation technique, and passes the obtained rectangular frame through After the initial screening of the IOU value, the polygon contour point set is used for secondary screening by the IOU value, and the rectangular frame that meets the two screenings is used as the target detection frame, and the target detection is continued.

To achieve the above objective, this application also provides a method for selecting a detection frame using Mask R-CNN, which is applied to an electronic device, and the method includes:

S110. Use Mask R-CNN to perform instance segmentation on the target image to obtain a rectangular candidate detection frame and its polygon contour; S120. Calculate the IOU values of the candidate detection frame and the polygon contour respectively; when the candidate detection frame is When the IOU value is greater than the first preset threshold IOU ₁ , and the IOU value of the polygon contour is greater than the second preset threshold IOU ₂ , the candidate detection frame is screened out as the target detection frame; wherein, the second preset threshold IOU _{2 is} greater than the first preset threshold IOU ₁ .

In order to achieve the above purpose, a system for selecting a detection frame using Mask R-CNN includes an instance segmentation module and a target detection frame screening module; wherein the instance segmentation module is used for instance segmentation of the target image using Mask R-CNN , Obtain a rectangular candidate detection frame and a polygon contour corresponding to the candidate detection frame; the target detection frame screening module is configured to calculate the IOU values of the candidate detection frame and the polygon contour respectively; when the candidate detection frame When the IOU value of the frame is greater than the first preset threshold IOU ₁ , and the IOU value of the polygon outline is greater than the second preset threshold IOU ₂ , the candidate detection frame is screened out as the target detection frame; wherein, the second preset Set the threshold IOU _{2 to be} greater than the first preset threshold IOU ₁ .

In order to achieve the above object, the present application provides an electronic device, the electronic device includes: a memory, a processor, the memory includes a detection box selection program, when the detection box selection program is executed by the processor, the implementation is as follows Steps: S110. Use Mask R-CNN to perform instance segmentation on the target image to obtain a rectangular candidate detection frame and its polygonal contour; S120. Calculate the IOU values of the candidate detection frame and the polygonal contour respectively, and respectively preset them Thresholds are compared; wherein, the preset threshold of the candidate detection frame is IOU ₁ , the preset threshold of the polygon contour is IOU ₂ , and IOU _{2 is} greater than IOU ₁ ; S130. The IOU value of the candidate detection frame is selected to be greater than IOU ₁ and the candidate detection frame whose polygonal contour IOU value is greater than IOU ₂ is used as the target detection frame.

In addition, in order to achieve the above-mentioned object, the present application also provides a computer-readable storage medium, the computer-readable storage medium stores a computer program, the computer program includes a check box selection program, the check box selection program is When the processor executes, it implements the steps of the above-mentioned method of using Mask R-CNN to select a detection frame.

The method and system for selecting a detection frame using Mask R-CNN, an electronic device, and a computer-readable storage medium proposed in this application, using the Mask R-CNN (MaskRegion-based Convolutional Neural Network) neural network to perform the calculation method, monitor the depth of the image The neural network is continuously convolved and pooled, and the key features of the image are extracted and processed by the neural network algorithm, and the detection results and categories are obtained (that is, the rectangular frame of the object in the image is obtained); the obtained rectangular frame is compared with the real target Preliminary screening of the IOU value is performed on the overlapping part between the two; then, the polygon point set obtained by Mask (ie, the polygon outline obtained by the instance segmentation) is further used to perform the secondary screening of the IOU value of the polygon between the polygon point set and the real target, and finally accord with Set the border of the threshold as the detection frame. The beneficial effects are as follows:

(1) Get the polygon point set of the target through the Mask of R-CNN, and reduce the pixel range (that is, reduce the bounding box range) on the basis of the rectangular candidate frame, so as to achieve more detailed target classification;

(2) According to the characteristics of shadows, combined with two-dimensional array coding to form an analysis method for judging whether the mirrors exist, so as to achieve the purpose of eliminating the false targets of shadows;

(3) Calculate the IOU of the polygon contour using the two-dimensional array coding method, which is accurate and fast;

(4) For the selection of the candidate frame, first go through the IOU preliminary screening of the candidate frame, and then pass the IOU secondary screening of the polygon point set, and further return to obtain a more accurate target detection frame.

Description of the drawings

Fig. 1 is a flowchart of a preferred embodiment of a method for selecting a detection frame by using Mask R-CNN in this application;

FIG. 2 is a flowchart of a preferred embodiment of a method for calculating an IOU value using a two-dimensional array mapping coding method according to this application;

FIG. 3 is a schematic diagram of a preferred embodiment of the two-dimensional array mapping coding method of this application;

FIG. 4 is a schematic structural diagram of a preferred embodiment of a system for selecting a detection frame by using Mask R-CNN in this application;

FIG. 5 is a schematic structural diagram of a preferred embodiment of the electronic device of this application;

The realization, functional characteristics, and advantages of the purpose of this application will be further described in conjunction with the embodiments and with reference to the accompanying drawings.

detailed description

It should be understood that the specific embodiments described here are only used to explain the application, and not used to limit the application.

It should be noted that in this article, the words "first" and "second" are only used to distinguish the same names, and do not imply the relationship or order between these names.

The purpose of target detection is to identify and locate a specific category of objects in a picture or video. The detection process can be regarded as a classification process to distinguish between the target and the background. In the detection process, the selection of the detection frame affects the elimination of interference in the detection and the level of detail in the classification in the detection.

This application provides a method for selecting a detection frame using Mask R-CNN. Referring to FIG. 1, it is a flowchart of a preferred embodiment of a method for selecting a detection frame by using Mask R-CNN in this application. The method can be executed by a device, and the device can be implemented by software and/or hardware.

Among them, Mask Region-based Convolutional Neural Network (MaskRegion-based Convolutional Neural Network) is a mask used to predict the category of the detection object in the image and fine-tune the border to segment the polygon contour of the detection object; among them, the bounding box is a mask that can include the image The smallest rectangular frame of an object.

In this embodiment, the method of using Mask R-CNN to select a detection frame includes: step S110-step S130.

S110. Use Mask R-CNN to perform instance segmentation on the target image to obtain a rectangular candidate detection frame and its polygonal contour.

The instance segmentation of Mask R-CNN is divided into two steps: the first action is to select, the position and type of the selected candidate frame (that is, predict the category of the object in the image and fine-tune the frame), and the selected is a rectangle; The action of the second step is segmentation, and the segmented is the polygon outline (obtained by the mask layer Mask branch).

S120, the candidate values are calculated IOU detection frame and the polygonal profile, polygonal profile and the IOU values is greater than a second predetermined threshold value ₂ IOU, the selected candidate block is detected as a target detection frame; wherein, The second preset threshold IOU _{2 is} greater than the first preset threshold IOU _{1. It} should be noted that, IOU (Intersection over Union), IOU can be understood as the degree of overlap between the prediction frame and the candidate detection frame.

In a specific embodiment, the first preset threshold IOU ₁ and the second preset threshold IOU ₂ can be set according to different scenarios; and, in order to improve the detection accuracy of the rectangular detection frame, the second preset threshold IOU _{2 is} greater than The first preset threshold IOU ₁ .

The first matching between the candidate detection frame and the predicted target is performed first, and the first matching result is screened, that is, the screening is performed when the IOU value of the candidate detection frame is greater than IOU ₁ .

Then, the second matching between the polygon contour and the predicted target is performed, and the second matching result is screened, that is, the IOU value of the polygon contour is larger than IOU ₂ screening.

The candidate detection frame after the two screenings is completed as the final target detection frame.

In a specific embodiment, the value ranges of the _{first preset threshold IOU 1} and the second preset threshold IOU _{2 are both 0.5-0.7.}

In summary, the two branch results of the candidate detection frame and the polygon contour obtained by the instance segmentation of Mask R-CNN, this application establishes a new judgment relationship between the two parallel and non-intersecting branch results; by using the candidate detection frame IOU was initially screened, and the polygonal contour was used for IOU secondary screening; and then a target detection frame with higher detection accuracy was obtained.

2 shows a flowchart of a preferred embodiment of the method for calculating an IOU value using a two-dimensional array mapping encoding method of the present application; FIG. 2 shows that the method for calculating an IOU value using a two-dimensional array mapping encoding method includes the steps: S210-S230;

S210. Map the polygon contour and its prediction frame to a plane template pre-divided by a line segment combination, wherein the line segment combination divides the plane template into equal-sized segmented blocks;

Referring to FIG. 3, a schematic diagram of a preferred embodiment of the two-dimensional array mapping encoding method of the present application; FIG. 3 shows the encoding process of the two-dimensional array mapping encoding method.

The right side is the object detected by the target, and its outer side is the polygon contour; the polygon contour is mapped to the binary image; as shown in Figure 3, the binary image is divided into equal-sized segments by the combination of selected segments, and the segmentation in the binary image A block consists of a segmented block coded as 1 and a segmented block coded as 0.

S220. Correspond the mapping results of the polygon contour and its prediction frame to a binary image the same size as the plane template, and represent each segmentation block as a two-dimensional array of mapping coding (A, B) form; wherein, The coding state of the segmented block corresponding to the polygon contour is assigned the value A, and the coding status of the segmented block corresponding to the prediction frame is assigned the value B; when the segmented block is located within the polygonal contour, A=1, the segmented block is located outside the polygonal contour When A=0; B=1 when the partition block is located in the prediction frame, and B=0 when the partition block is located outside the prediction frame.

As shown in FIG. 3, the humanoid contour on the right is mapped to the binary image on the left. When the segment is located within the polygonal contour, the segment is assigned a value of 1, and the segment is located on the polygonal contour. The outer time segmentation block is assigned a value of 0. The binary graph after assignment is shown in Figure 3.

Specifically, because there are differences between the polygon contour and the prediction frame of the polygon contour, each segmentation block may be assigned different values when the corresponding polygon contour and the prediction frame of the corresponding polygon contour are different. If a segmentation block is both in the polygon contour and the prediction frame of the polygon contour, the coding of the segmentation block is (1, 1); if a segmentation block is only in the polygon contour, not in the prediction frame of the polygon contour, Then the code of the segmentation block is (1, 0); if a segmentation block is not in the polygon contour, but only in the prediction frame of the polygon contour, then the code of the segmentation block is (0, 1); if a segmentation block is neither If the polygon contour is not in the prediction frame of the polygon contour, the coding of the partition block is (0, 0). Therefore, the coding of the divided block has the above-mentioned (1, 1), (1, 0), (0, 1) and (0, 0) four coding situations.

S230. Obtain the IOU value by counting the coding of the divided blocks; where IOU=number of divided blocks coded as (1,1)/[number of divided blocks coded as (1,0)+coded as (0,1) The number of partitions of) + the code is (1, 1) the number of partitions].

IOU = area of the intersection polygon/(polygon outline area + prediction frame area-intersection polygon area);

Therefore, the area of the intersection polygon = the area of the intersection between the polygon outline and its prediction frame; the area of the union polygon = the area of the polygon outline + the area of the prediction frame-the area of the intersection polygon; the area of the intersection between the polygon outline and its prediction frame is also coded as The area of all the partitions of (1,1); and the area of the union polygon is equivalent to the area of the partition coded as (1,0) + the area of the partition coded as (0,1) + code is (1,1) ) The area of the partition; therefore, the area of the intersection polygon/the area of the union polygon=IOU=the number of partitions coded as (1, 1)/[the number of partitions coded as (1, 0) + code is (0 , 1) the number of partitions + code is (1, 1) the number of partitions].

In a specific embodiment, when there is a "shadow" or "mirror" in the detected scene, a detection frame will be generated for the detection target and the "mirror" (or shadow) of the target at the same time, which is very easy to cause the existence of two detection targets. Misjudgment. Perform two-dimensional array mapping encoding on all obtained candidate detection frames; perform a coincidence degree comparison on the encoded candidate detection frames; when the coincidence degree of two candidate detection frames is greater than the coincidence threshold, determine the two candidate detection frames There is a mirror image in the detected target.

The coincidence threshold here is set to 75%; that is, if the coding coincidence degree of the two candidate detection frames reaches 75%, it is determined that there is interference such as mirror or image, so as to eliminate the interference.

In a specific embodiment, calculating the IOU value of the polygon contour includes calculating the IOU value of the polygon contour by an intersection area method; wherein the intersection area method includes: S310. Obtain the polygon contour The key points of the prediction frame are marked with the key points, where the key points include the vertices of the polygon outline and its prediction frame, and the intersection points of the polygon outline and the prediction frame; S320, adding the intersection points and the intersection points The internal points are sorted to form the point set of the intersection polygon; S330. Calculate the polygon outline and the area of the prediction frame, the area of the intersection polygon, and calculate the polygon according to the polygon outline and the area of the prediction frame, and the area of the intersection polygon The IOU value of the contour, IOU=the area of the intersection polygon/(the area of the polygon outline+the area of the prediction frame-the area of the intersection polygon).

The above method of using Mask R-CNN to select a detection frame is implemented by Mask R-CNN selecting a detection frame model. The neural network structure of the Mask R-CNN selection detection frame model includes the convolutional layer Mask R-CNN and the mask R-CNN. -RoI Align layer after CNN. The neural network structure of the Mask R-CNN selection detection frame model further includes a mask layer, a classifier and a fully connected layer, and the fully connected layer is used for RoI frame correction training.

Specifically, the neural network structure of Mask R-CNN to select the detection frame model includes:

In general, Mask R-CNN is to segment the target pixels while achieving target detection; in other words, it adds a Mask branch network to the basic frame recognition architecture, where the Mask branch network is used for the target. Pixel segmentation, so as to obtain the polygon contour point set of the target.

After the CNN convolutional layer, there is the RoI Align layer, followed by the mask layer, classifier and RoI frame correction training (fully connected layer). Among them, Mask R-CNN inherits the RPN part of Faster R-CNN.

The process of performing the task includes: using the shared convolutional layer to extract features for the detection target image, and then sending the resulting feature maps to the RPN, and the RPN generates the frame to be detected (the position of the RoI is specified) and the bounding frame of the RoI is performed for the first time Fix. After that is the Fast R-CNN architecture. RoIAlign selects the features corresponding to each RoI on the feature map based on the output of the RPN, and sets the dimension to a fixed value. Finally, the fully connected layer (FC Layer) is used to classify the boxes, and the second modification of the target bounding box is performed; finally, candidate detection boxes (box regression) and classification (classification) are obtained.

The other branch is the head part. Mask R-CNN finally expands the output dimension of RoIAlign and predicts a Mask; that is, the result obtained by Mask branch is the point set of the polygon outline.

For Mask R-CNN, prediction Mask and classification (and candidate detection frame) have their own training parameters. Before the Mask R-CNN model training, set the hyperparameters of the Mask R-CNN model to the parameter values of the FAster R-CNN model, and use ResNet50, ResNet101, and FPN networks to pre-train the hyperparameters; further use A large number of samples train the Mask R-CNN model to obtain the Mask R-CNN model. After training the Mask R-CNN model, use the test samples to test the Mask R-CNN model to verify the accuracy of the Mask R-CNN model.

In a specific embodiment, the training data set is COCO trainval35k, which has 80 object categories and 1.5 million object instances.

In a specific embodiment, the results obtained from the detection of the trained Mask R-CNN model are stored in a distributed database, so as to update the trained Mask R-CNN model using the distributed database.

In summary, the input image is the target's multi-angle image to form a sample library; the sample is sent to the Mask R-CNN detection and recognition model for training, and the image features are extracted in the convolutional layer, and finally the target is obtained. The classification box and the corresponding target state and the polygon point set of the instance segmentation.

In order to achieve the above purpose, a system 400 for selecting a detection frame using Mask R-CNN includes an instance segmentation module 410 and a target detection frame screening module 420; wherein, the instance segmentation module 410 is used to use Mask R-CNN for target detection. The image is divided into instances to obtain a rectangular candidate detection frame and a polygon outline corresponding to the candidate detection frame; the target detection frame screening module 420 is configured to calculate the IOU values of the candidate detection frame and the polygon outline respectively; When the IOU value of the candidate detection frame is greater than the first preset threshold IOU ₁ , and the IOU value of the polygon contour is greater than the second preset threshold IOU ₂ , the candidate detection frame is screened out as the target detection frame; wherein, The second preset threshold IOU _{2 is} greater than the first preset threshold IOU ₁ .

The target detection frame screening module 420 includes a first sub-module for obtaining the IOU value of the polygon contour, and the first sub-module for obtaining the IOU value of the polygon contour is used to calculate the IOU value of the polygon contour by a two-dimensional array mapping coding method .

Specifically, the first sub-module for obtaining the IOU value of the polygonal contour includes a two-dimensional array mapping unit and a first obtaining unit for the IOU value of the polygonal contour; wherein the two-dimensional array mapping unit is used to combine the polygonal contour with the first submodule. The prediction boxes are respectively mapped to a plane template pre-divided by a line segment combination, wherein the line segment combination divides the plane template into equal-sized segmented blocks; the polygon contour and the mapping result of its prediction box are respectively corresponding to the On a binary graph such as the plane template, each segmentation block is represented as a two-dimensional array of mapping coding (A, B) form; where the segmentation block corresponds to the coding state of the polygon contour is assigned the value A, and the segmentation block corresponds to the prediction frame The coding state of is assigned the value B; when the partition block is located in the polygon contour, A=1, when the partition block is located outside the polygon contour, A=0; when the partition block is located in the prediction frame B=1, when the partition block is outside the prediction frame, B=0; the first acquiring unit of the IOU value of the polygon contour is used to calculate the IOU value by counting the encoding of the partition block; where IOU=encoding is ( The number of partitions of 1, 1)/[the number of partitions coded as (1, 0) + the number of partition blocks coded as (0, 1) + the number of partition blocks coded as (1, 1)].

In a specific embodiment, the target detection frame screening module 420 includes a second sub-module for obtaining the IOU value of the polygonal outline, and the second sub-module for obtaining the IOU value of the polygonal outline is used to calculate the total area by the intersection and union method. The IOU value of the polygon profile; the second sub-module for acquiring the IOU value of the polygon profile includes a point set acquiring unit and a second acquiring unit of the polygon profile IOU value; wherein, the point set acquiring unit is configured to obtain the Key points of the polygon contour and its prediction frame, and label the key points, where the key points include the vertices of the polygon contour and its prediction frame, and the intersections of the polygon contour and its prediction frame; The internal points are sorted to form the point set of the intersection polygon; the second acquisition unit of the IOU value of the polygon outline is used to calculate the area of the polygon outline and its prediction frame, and the area of the intersection polygon, and according to the polygon outline and the area of the intersection polygon. The area of the prediction frame and the area of the intersection polygon calculate the IOU value of the polygon outline, IOU=the area of the intersection polygon/(polygon outline area + prediction frame area-intersection polygon area).

In a specific embodiment, the _{value ranges of the first preset threshold IOU 1} and the second preset threshold IOU ₂ in the target detection frame screening module are both 0.5-0.7.

In a specific embodiment, it also includes a mirror screening module 430, which is used to perform two-dimensional array mapping coding on all the candidate detection frames selected; to compare the coincidence degree of the coded candidate detection frames; when the two candidate detection frames are When the coincidence degree of the frames is greater than the coincidence threshold, it is determined that there is a mirror image in the target detected by the two candidate detection frames.

In a specific embodiment, the aforementioned system using Mask R-CNN to select a detection frame is implemented by Mask R-CNN to select a detection frame model. The neural network structure of the Mask R-CNN selection detection frame model includes a convolutional layer Mask R- CNN and the RoI Align layer behind the Mask R-CNN.

In a specific embodiment, the neural network structure of the Mask R-CNN selection detection frame model further includes a mask layer, a classifier, and a fully connected layer, and the fully connected layer is used for RoI frame correction training.

This application provides a method for selecting a detection frame using Mask R-CNN, which is applied to an electronic device 5. Referring to FIG. 5, it is a schematic diagram of an application environment of a preferred embodiment of the method for selecting a detection frame by using Mask R-CNN in this application.

In this embodiment, the electronic device 5 may be a terminal device with arithmetic function, such as a server, a smart phone, a tablet computer, a portable computer, a desktop computer, and the like.

The electronic device 5 includes a processor 52, a memory 51, a communication bus 53 and a network interface 55.

The memory 51 includes at least one type of readable storage medium. The at least one type of readable storage medium may be a non-volatile storage medium such as flash memory, hard disk, multimedia card, card-type memory 51, and the like. In some embodiments, the readable storage medium may be an internal storage unit of the electronic device 5, such as a hard disk of the electronic device 5. In other embodiments, the readable storage medium may also be the external memory 51 of the electronic device 5, such as a plug-in hard disk equipped on the electronic device 5, or a smart memory card (Smart Media Card, SMC). , Secure Digital (SD) card, Flash Card (Flash Card), etc.

In this embodiment, the readable storage medium of the memory 51 is generally used to store the selection program 50 of the detection frame installed in the electronic device 5 and the like. The memory 51 can also be used to temporarily store data that has been output or will be output.

The processor 52 may be a central processing unit (CPU), microprocessor or other data processing chip in some embodiments, and is used to run the program code or processing data stored in the memory 51, for example, to execute the detection frame Select program 50 and so on.

The communication bus 53 is used to realize the connection and communication between these components.

The network interface 54 may optionally include a standard wired interface and a wireless interface (such as a WI-FI interface), and is generally used to establish a communication connection between the electronic device 5 and other electronic devices.

FIG. 5 only shows the electronic device 5 with the components 51-54, but it should be understood that it is not required to implement all the illustrated components, and more or fewer components may be implemented instead.

Optionally, the electronic device 5 may also include a user interface. The user interface may include an input unit such as a keyboard (Keyboard), a voice input device such as a microphone (microphone) and other devices with voice recognition functions, and a voice output device such as audio, earphones, etc. Optionally, the user interface may also include a standard wired interface and a wireless interface.

Optionally, the electronic device 5 may also include a display, and the display may also be referred to as a display screen or a display unit. In some embodiments, it may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, and an organic light-emitting diode (Organic Light-Emitting Diode, OLED) touch device, etc. The display is used to display the information processed in the electronic device 5 and to display a visualized user interface.

Optionally, the electronic device 5 may also include a radio frequency (RF) circuit, a sensor, an audio circuit, etc., which will not be repeated here.

In the device embodiment shown in FIG. 5, the memory 51 as a computer storage medium may include an operating system and a detection box selection program 50; when the processor 52 executes the detection box selection program 50 stored in the memory 51 The following steps are implemented: S110. Use Mask R-CNN to perform instance segmentation on the target image to obtain a rectangular candidate detection frame and a polygon contour corresponding to the candidate detection frame; S120. Calculate the candidate detection frame and the polygon contour respectively When the IOU value of the polygon contour is greater than the second preset threshold IOU ₂ , the candidate detection frame is screened out as the target detection frame; wherein, the second preset threshold IOU _{2 is} greater than the first preset threshold. Threshold IOU ₁ .

In other embodiments, the detection frame selection program 50 may also be divided into one or more modules, and the one or more modules are stored in the memory 51 and executed by the processor 52 to complete the application. The module referred to in this application refers to a series of computer program instruction segments that can complete specific functions.

In addition, the embodiment of the present application also proposes a computer-readable storage medium, the computer-readable storage medium includes a detection box selection program, and when the detection box selection program is executed by a processor, the following operations are implemented: S110, use Mask R-CNN performs instance segmentation on the target image to obtain a rectangular candidate detection frame and its polygonal contour; S120. Calculate the IOU values of the candidate detection frame and the polygon contour respectively; when the IOU value of the candidate detection frame is greater than When the first preset threshold IOU ₁ and the IOU value of the polygonal contour is greater than the second preset threshold IOU ₂ , the candidate detection frame is screened out as the target detection frame; wherein, the second preset threshold IOU _{2 is} greater than The first preset threshold IOU ₁ .

The computer-readable storage medium described in this application may be a non-volatile computer-readable storage medium. The specific implementation of the computer-readable storage medium of the present application is substantially the same as the specific implementation of the above-mentioned method and system for selecting a detection frame using Mask R-CNN, and the electronic device, and will not be repeated here.

In general, the calculation method of this application using the Mask R-CNN neural network, the monitoring image is continuously convolved and pooled in the deep neural network, and the neural network algorithm is used to extract and process the key features of the image to obtain The rectangular frame of the object in the image; the IOU value of the overlap between the obtained rectangular frame and the real target is preliminarily screened; then the polygon contour obtained by Mask is further used to calculate the IOU value of the polygon between the polygon point set and the real target After the second screening, the frame that finally meets the set threshold is used as the detection frame.

It should be noted that in this article, the terms "include", "include" or any other variants thereof are intended to cover non-exclusive inclusion, so that a process, device, article or method including a series of elements not only includes those elements, It also includes other elements not explicitly listed, or elements inherent to the process, device, article, or method. If there are no more restrictions, the element defined by the sentence "including a..." does not exclude the existence of other identical elements in the process, device, article, or method that includes the element.

The serial numbers of the foregoing embodiments of the present application are only for description, and do not represent the advantages and disadvantages of the embodiments. Through the description of the above implementation manners, those skilled in the art can clearly understand that the above-mentioned embodiment method can be implemented by means of software plus the necessary general hardware platform, of course, it can also be implemented by hardware, but in many cases the former is better.的实施方式。 Based on this understanding, the technical solution of this application essentially or the part that contributes to the existing technology can be embodied in the form of a software product, and the computer software product is stored in a storage medium (such as ROM/RAM) as described above. , Magnetic disks, optical disks), including a number of instructions to enable a terminal device (which can be a mobile phone, a computer, a server, or a network device, etc.) to execute the method described in each embodiment of the present application.

The above are only the preferred embodiments of the application, and do not limit the scope of the patent for this application. Any equivalent structure or equivalent process transformation made using the content of the description and drawings of the application, or directly or indirectly applied to other related technical fields , The same reason is included in the scope of patent protection of this application.

Claims

A method for selecting a detection frame using Mask R-CNN, applied to an electronic device, is characterized in that the method includes:

Use Mask R-CNN to perform instance segmentation on the target image to obtain a rectangular candidate detection frame and a polygon contour corresponding to the candidate detection frame;

Calculate the IOU values of the candidate detection frame and the polygon contour respectively; when the IOU value of the candidate detection frame is greater than the first preset threshold IOU 1 , and the IOU value of the polygon contour is greater than the second preset threshold IOU 2 When the detection frame is selected, the candidate detection frame is selected as the target detection frame; wherein, the second preset threshold IOU 2 is greater than the first preset threshold IOU 1 .
The method for selecting a detection frame using Mask R-CNN according to claim 1, wherein calculating the IOU value of the polygon contour comprises calculating the IOU value of the polygon contour by a two-dimensional array mapping coding method;

The two-dimensional array mapping coding method includes:

Mapping the polygon contour and its prediction frame to a plane template previously divided by a line segment combination, wherein the line segment combination divides the plane template into equal-sized segmented blocks;

The mapping results of the polygon contour and its prediction frame are respectively mapped to a binary image as large as the plane template, and each segmentation block is represented as a two-dimensional array of mapping coding (A, B) form; where the segmentation block The coding state of the corresponding polygon contour is assigned the value A, and the coding state of the segmented block corresponding to the prediction frame is assigned the value B;

When the partition block is located in the polygon contour, A=1, when the partition block is located outside the polygon contour, A=0; when the partition block is located in the prediction frame, B=1, the partition B=0 when the block is outside the prediction frame;

Obtain the IOU value by counting the coding of the divided blocks; among them, IOU = the number of divided blocks coded as (1, 1)/[the number of divided blocks coded as (1, 0) + coded as (0, 1) The number of divided blocks + the code is (1, 1) the number of divided blocks].
The method for selecting a detection frame using Mask R-CNN according to claim 1, wherein calculating the IOU value of the polygon contour comprises calculating the IOU value of the polygon contour by an intersection area method;

The method of intersection area includes:

Obtaining key points of the polygon contour and its prediction frame, and labeling the key points, where the key points include the vertices of the polygon contour and its prediction frame and the intersections of the polygon contour and its prediction frame;

The intersection point and the points inside the intersection point are sorted to form a point set of the intersection polygon;

Calculate the area of the polygon contour and its prediction frame, the area of the intersection polygon, and calculate the IOU value of the polygon contour according to the polygon contour and the area of the prediction frame, and the area of the intersection polygon, IOU=area of the intersection polygon/(polygon contour Area + prediction box area-intersection polygon area).
The method for selecting a detection frame using Mask R-CNN according to claim 1, wherein the range of the first preset threshold IOU 1 and the second preset threshold IOU 2 are both 0.5-0.7 .
The method for selecting a detection frame using Mask R-CNN according to claim 2, characterized in that, after the screening of the candidate detection frame as the target detection frame, the method further comprises:

Perform two-dimensional array mapping coding on all candidate detection frames selected;

Compare the coincidence degree of the encoded candidate detection frame;

When the coincidence degree of the two candidate detection frames is greater than the coincidence threshold, it is determined that there is a mirror image in the target detected by the two candidate detection frames.
The method for selecting a detection frame using Mask R-CNN according to claim 1, wherein the method for selecting a detection frame using Mask R-CNN is implemented by a Mask R-CNN selection detection frame model, and the Mask R-CNN The neural network structure for selecting the detection frame model includes a convolutional layer Mask R-CNN and a RoI Align layer behind the Mask R-CNN.
The method for selecting a detection frame using Mask R-CNN according to claim 6, wherein the neural network structure of the Mask R-CNN selection detection frame model further includes a mask layer, a classifier, and a fully connected layer, so The fully connected layer is used for RoI frame correction training.
A system for selecting a detection frame using Mask R-CNN is characterized in that it includes an instance segmentation module and a target detection frame screening module; wherein,

The instance segmentation module is configured to use Mask R-CNN to perform instance segmentation on a target image to obtain a rectangular candidate detection frame and a polygon contour corresponding to the candidate detection frame;

The target detection frame screening module is configured to calculate the IOU values of the candidate detection frame and the polygon contour respectively; when the IOU value of the candidate detection frame is greater than a first preset threshold IOU 1 , and the polygon contour is When the IOU value is greater than the second preset threshold IOU 2 , the candidate detection frame is screened out as the target detection frame; wherein, the second preset threshold IOU 2 is greater than the first preset threshold IOU 1 .
The system for selecting a detection frame using Mask R-CNN according to claim 8, wherein the target detection frame screening module comprises a first sub-module for obtaining the IOU value of the polygon contour, and the IOU value obtaining the first submodule of the polygon contour A sub-module for calculating the IOU value of the polygon contour through a two-dimensional array mapping coding method;

The first sub-module for acquiring the IOU value of the polygon contour includes a two-dimensional array mapping unit and a first acquiring unit of the IOU value of the polygon contour; wherein,

The two-dimensional array mapping unit is configured to map the polygon contour and its prediction frame to a plane template pre-divided by line segment combination, wherein the line segment combination divides the plane template into equal-sized segmented blocks ; The mapping results of the polygon contour and its prediction frame are respectively mapped to a binary image as large as the plane template, and each segmentation block is represented as a two-dimensional array of mapping coding (A, B) form; where the segmentation The coding state of the block corresponding to the polygon contour is assigned the value A, and the coding state of the segmentation block corresponding to the prediction frame is assigned the value B; when the segmentation block is within the polygon contour, A=1 and the segmentation block is outside the polygon contour A=0; B=1 when the partition block is located in the prediction frame, and B=0 when the partition block is located outside the prediction frame;

The first acquiring unit of the IOU value of the polygon contour is used to obtain the IOU value by counting the coding of the divided blocks; wherein, IOU=number of divided blocks coded as (1, 1)/[coded as (1, 0) Number of partitions + number of partitions coded as (0, 1) + number of partitions coded as (1, 1)].
The system for selecting a detection frame using Mask R-CNN according to claim 8, wherein the target detection frame screening module includes a second sub-module for obtaining the IOU value of the polygon contour, and the IOU value of the polygon contour obtaining the first submodule. Two sub-modules, used to calculate the IOU value of the polygon outline by the intersection and union area method;

The second sub-module for acquiring the IOU value of the polygon contour includes a point set acquiring unit and a second acquiring unit of the IOU value of the polygon contour; wherein,

The point set acquisition unit is configured to obtain key points of the polygon contour and its prediction frame, and mark the key points, wherein the key points include the polygon contour and the vertices of the prediction frame and the polygon contour Each intersection point with its prediction frame; the intersection point and the points inside the intersection point are sorted to form the point set of the intersection polygon;

The second acquiring unit of the IOU value of the polygon contour is used to calculate the area of the polygon contour and its prediction frame, and the area of the intersection polygon, and calculate the polygon contour according to the area of the polygon contour and its prediction frame, and the area of the intersection polygon The IOU value of, IOU=area of intersection polygon/(polygon outline area+area of prediction frame-area of intersection polygon).
The system for selecting a detection frame using Mask R-CNN according to claim 8, wherein the first preset threshold IOU 1 and the second preset threshold IOU 2 in the target detection frame screening module The value range is 0.5-0.7.
The system for selecting a detection frame using Mask R-CNN according to claim 8, characterized in that it further comprises a mirror screening module for performing two-dimensional array mapping encoding on all candidate detection frames selected; The coincidence degree comparison of the candidate detection frames is performed; when the coincidence degree of the two candidate detection frames is greater than the coincidence threshold, it is determined that there is a mirror image in the target detected by the two candidate detection frames.
The system for selecting a detection frame using Mask R-CNN according to claim 8, wherein the system for selecting a detection frame using Mask R-CNN is implemented by a Mask R-CNN selection detection frame model, and the Mask R-CNN The neural network structure for selecting the detection frame model includes a convolutional layer Mask R-CNN and a RoI Align layer behind the Mask R-CNN.
The system for selecting a detection frame using Mask R-CNN according to claim 8, wherein the neural network structure of the Mask R-CNN selection detection frame model further includes a mask layer, a classifier, and a fully connected layer, so The fully connected layer is used for RoI frame correction training.
An electronic device, characterized in that the electronic device includes a memory and a processor, the memory includes a detection frame selection program, and the following steps are implemented when the detection frame selection program is executed by the processor:

Use Mask R-CNN to perform instance segmentation on the target image to obtain a rectangular candidate detection frame and a polygon contour corresponding to the candidate detection frame;

Calculate the IOU values of the candidate detection frame and the polygon contour respectively; when the IOU value of the candidate detection frame is greater than the first preset threshold IOU 1 , and the IOU value of the polygon contour is greater than the second preset threshold IOU 2 When, the candidate detection frame is screened out as the target detection frame; wherein, the second preset threshold IOU 2 is greater than the first preset threshold IOU 1 .
15. The electronic device according to claim 15, wherein calculating the IOU value of the polygonal contour comprises calculating the IOU value of the polygonal contour by a two-dimensional array mapping coding method;

Mapping the polygon contour and its prediction frame to a plane template previously divided by a line segment combination, wherein the line segment combination divides the plane template into equal-sized segmented blocks;

The mapping results of the polygon contour and its prediction frame are respectively mapped to a binary image as large as the plane template, and each segmentation block is represented as a two-dimensional array of mapping coding (A, B) form; where the segmentation block The coding state of the corresponding polygon contour is assigned the value A, and the coding state of the segmented block corresponding to the prediction frame is assigned the value B;

When the partition block is located in the polygon contour, A=1, when the partition block is located outside the polygon contour, A=0; when the partition block is located in the prediction frame, B=1, the partition B=0 when the block is outside the prediction frame;

Obtain the IOU value by counting the coding of the divided blocks; among them, IOU = the number of divided blocks coded as (1, 1)/[the number of divided blocks coded as (1, 0) + coded as (0, 1) The number of divided blocks + the code is (1, 1) the number of divided blocks].
The electronic device according to claim 15, wherein:

The value ranges of the first preset threshold IOU 1 and the second preset threshold IOU 2 are both 0.5-0.7.
15. The electronic device according to claim 15, wherein after the screening of the candidate detection frame as the target detection frame, the method further comprises:

Perform two-dimensional array mapping coding on all candidate detection frames selected;

Compare the coincidence degree of the encoded candidate detection frame;

When the coincidence degree of the two candidate detection frames is greater than the coincidence threshold, it is determined that there is a mirror image in the target detected by the two candidate detection frames.
The electronic device according to claim 15, characterized in that the selection program including the detection frame in the memory is implemented by Mask R-CNN selection detection frame model, and the neural network structure of the Mask R-CNN selection detection frame model includes The convolutional layer Mask R-CNN and the RoI Align layer behind the Mask R-CNN.
A computer-readable storage medium, wherein the computer-readable storage medium stores a computer program, the computer program includes a detection box selection program, and when the detection box selection program is executed by a processor, the following is achieved: The steps of the method for selecting a detection frame by using Mask R-CNN according to any one of claims 1 to 7.