CN110619350A

CN110619350A - Image detection method, device and storage medium

Info

Publication number: CN110619350A
Application number: CN201910741273.4A
Authority: CN
Inventors: 张水发; 李岩; 王思博; 刘畅
Original assignee: Beijing Dajia Internet Information Technology Co Ltd
Current assignee: Beijing Dajia Internet Information Technology Co Ltd
Priority date: 2019-08-12
Filing date: 2019-08-12
Publication date: 2019-12-27
Anticipated expiration: 2039-08-12
Also published as: CN110619350B

Abstract

The present disclosure relates to an image detection method, apparatus, and storage medium. The method comprises the following steps: acquiring a characteristic diagram of a target image; according to the feature map, obtaining a first detection result of the target image through an image detection model, wherein the first detection result comprises a detection frame contained in the target image, a first prediction type of the detection frame and a confidence coefficient of the first prediction type; responding to the confidence coefficient of the first prediction category meeting a first threshold condition, and acquiring a target detection frame corresponding to the first prediction category; extracting image features of the target detection frame through a preset classification model, and acquiring a final prediction category of the target detection frame based on the image features; wherein the classification model is independent of the image detection model and is cascaded with the image detection model. Therefore, the method has the advantages of improving the accuracy of the image detection result and the applicability of the image detection scheme.

Description

Image detection method, device and storage medium

Technical Field

The present disclosure relates to the field of image processing technologies, and in particular, to an image detection method and apparatus, and a storage medium.

Background

With the rapid development of artificial intelligence, the application range of the network model is wider and wider. For example, picture recognition, text recognition, etc. may be performed by the model.

In the related art, the current image detection method based on deep learning can be divided into two main categories, namely one-stage detection and two-stage detection, and since class regression requires translation invariance, while bbox regression requires translation variability, in order to seek a balance, both the two methods put class regression and bbox (bounding box) regression in a network to meet basic requirements.

However, for some application scenarios, the detection effect accuracy of the image detection method is still poor. For example, in OCR (Optical Character Recognition), there are cases where the width and height of a Character are large, and the width or height is small, and accordingly, the score of a detection frame is greatly affected by inaccurate bbox regression, so that the accuracy of a detection result is poor, and the applicability of an image detection method is not strong.

Disclosure of Invention

The present disclosure provides an image detection method, an image detection device and a storage medium, which at least solve the problems of poor applicability and poor accuracy of detection results of image detection methods in the related art. The technical scheme of the disclosure is as follows:

according to a first aspect of the embodiments of the present disclosure, there is provided an image detection method, including:

acquiring a characteristic diagram of a target image;

according to the feature map, obtaining a first detection result of the target image through an image detection model, wherein the first detection result comprises a detection frame contained in the target image, a first prediction type of the detection frame and a confidence coefficient of the first prediction type;

responding to the confidence coefficient of the first prediction category meeting a first threshold condition, and acquiring a target detection frame corresponding to the first prediction category;

extracting image features of the target detection frame through a preset classification model, and acquiring a final prediction category of the target detection frame based on the image features;

wherein the classification model is independent of the image detection model and is cascaded with the image detection model.

Optionally, the classifying model includes N cascaded feature extraction models, where N is a positive integer, and the step of extracting the image feature of the target detection frame through a preset classifying model and obtaining the final prediction category of the target detection frame based on the image feature includes:

a1, extracting the image features of the target detection frame by taking the first feature extraction model as a target feature extraction module according to the cascade order of all feature extraction models;

a2, acquiring a current prediction category of the target detection frame and a first confidence coefficient of the current prediction category based on all image features of the target detection frame extracted currently;

a3, in response to the first confidence coefficient meeting a second threshold condition, taking the next feature extraction model of the target feature extraction module as a target feature extraction model, extracting the image features of the target detection frame through the target feature extraction model, and then entering A2 until the target feature extraction module is the Nth feature extraction model;

a4, in response to the first confidence coefficient not meeting a second threshold condition, taking the current prediction category as a final prediction category of the target detection box.

Optionally, before the step of extracting, by a preset classification model, an image feature of the target detection frame, and obtaining a final prediction category of the target detection frame based on the image feature, the method further includes:

acquiring a plurality of training sample images which are manually calibrated, and performing data enhancement on the training sample images to obtain a training sample image set;

and training the classification model according to the training sample image set, wherein the training sample images corresponding to any two feature extraction models are not completely consistent.

Optionally, the step of obtaining the feature map of the target image includes:

and zooming the target image into a first image with a preset size, and acquiring a feature map of the first image through a preset feature extraction network.

Optionally, in a case that the image detection model is a two-stage fast rcnn model, the step of obtaining a first detection result of the target image through the image detection model according to the feature map includes:

inputting the feature map into a region generation network of the image detection model to perform category regression and detection frame regression on the feature map to obtain a suggested detection frame of the target image;

a feature region corresponding to the suggested detection box is intercepted from the feature map;

and inputting the characteristic region into a pooling layer of the image detection model to perform category regression and detection frame regression on the characteristic region to obtain a first detection result of the target image.

Optionally, the step of inputting the feature map into a region generation network of the image detection model to perform category regression and detection frame regression on the feature map to obtain a suggested detection frame of the target image includes:

inputting the feature map into a region generation network of the image detection model to perform category regression and detection frame regression on the feature map to obtain an initial detection frame of the target image;

and carrying out non-maximum suppression on the initial detection frame to obtain the suggested detection frame.

According to a second aspect of the embodiments of the present disclosure, there is provided an image detection apparatus including:

a feature map acquisition module configured to perform acquisition of a feature map of a target image;

a first image detection module configured to perform obtaining, according to the feature map, a first detection result of the target image through an image detection model, where the first detection result includes a detection frame included in the target image, a first prediction category of the detection frame, and a confidence of the first prediction category;

a target detection frame acquisition module configured to execute, in response to the confidence of the first prediction category satisfying a first threshold condition, acquiring a target detection frame corresponding to the first prediction category;

the second image detection module is configured to extract image features of the target detection frame through a preset classification model and obtain a final prediction category of the target detection frame based on the image features;

Optionally, the classification model includes N cascaded feature extraction models, where N is a positive integer, and the second image detection module includes:

the first image feature extraction sub-module is configured to extract the image features of the target detection frame by taking the first feature extraction model as a target feature extraction module according to the cascade order of the feature extraction models;

a detection frame classification sub-module configured to perform obtaining a current prediction category of the target detection frame and a first confidence of the current prediction category based on all image features of the target detection frame extracted currently;

the first image feature extraction sub-module is configured to execute, in response to the first confidence coefficient meeting a second threshold condition, taking a next feature extraction model of the target feature extraction module as a target feature extraction model, extracting the image features of the target detection frame through the target feature extraction model, and then entering a detection frame classification sub-module until the target feature extraction module is an Nth feature extraction model;

a final prediction category validation sub-module configured to perform, in response to the first confidence not satisfying a second threshold condition, taking the current prediction category as a final prediction category of the target detection box.

Optionally, the apparatus further comprises:

the training sample image acquisition module is configured to acquire a plurality of artificially calibrated training sample images and perform data enhancement on the training sample images to obtain a training sample image set;

and the classification model training module is configured to execute training of the classification model according to the training sample image set, and training sample images corresponding to any two feature extraction models are not completely consistent.

Optionally, the feature map obtaining module is further configured to scale the target image into a first image with a preset size, and obtain the feature map of the first image through a preset feature extraction network.

Optionally, in a case that the image detection model is a two-stage fast rcnn model, the first image detection module includes:

a suggested detection frame obtaining sub-module configured to perform a region generation network for inputting the feature map into the image detection model, so as to perform category regression and detection frame regression on the feature map, thereby obtaining a suggested detection frame of the target image;

a feature region screenshot submodule configured to execute a feature region corresponding to the proposed detection box, which is intercepted from the feature map;

the detection result acquisition submodule is configured to input the feature region into a pooling layer of the image detection model so as to perform category regression and detection frame regression on the feature region to obtain a first detection result of the target image.

Optionally, the suggestion detection frame obtaining sub-module includes:

an initial detection frame obtaining unit configured to perform a region generation network for inputting the feature map into the image detection model to perform category regression and detection frame regression on the feature map to obtain an initial detection frame of the target image;

and the suggestion detection frame acquisition unit is configured to perform non-maximum suppression on the initial detection frame to obtain the suggestion detection frame.

According to a third aspect of the embodiments of the present disclosure, there is provided an image detection apparatus, comprising:

a processor;

a memory for storing the processor-executable instructions;

wherein the processor is configured to execute the instructions to implement any one of the image detection methods as described above.

According to a fourth aspect of the embodiments of the present disclosure, there is provided a storage medium having instructions that, when executed by a processor of an image detection apparatus, enable the image detection apparatus to perform any one of the image detection methods as described above.

According to a third aspect of the embodiments of the present disclosure, there is provided a computer program product, wherein instructions that, when executed by a processor of an image detection apparatus, enable the image detection apparatus to perform any one of the image detection methods as described above.

The technical scheme provided by the embodiment of the disclosure at least brings the following beneficial effects:

in the embodiment of the disclosure, a feature map of a target image is obtained; according to the feature map, obtaining a first detection result of the target image through an image detection model, wherein the first detection result comprises a detection frame contained in the target image, a first prediction type of the detection frame and a confidence coefficient of the first prediction type; responding to the confidence coefficient of the first prediction category meeting a first threshold condition, and acquiring a target detection frame corresponding to the first prediction category; extracting image features of the target detection frame through a preset classification model, and acquiring a final prediction category of the target detection frame based on the image features; wherein the classification model is independent of the image detection model and is cascaded with the image detection model. Therefore, the method has the advantages of improving the accuracy of the image detection result and the applicability of the image detection scheme.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the present disclosure and, together with the description, serve to explain the principles of the disclosure and are not to be construed as limiting the disclosure.

FIG. 1 is one of the flow charts illustrating one method of image detection according to one exemplary embodiment.

Fig. 2 is a second flowchart illustrating an image detection method according to an exemplary embodiment.

Fig. 3 is one of block diagrams illustrating an image detection apparatus according to an exemplary embodiment.

Fig. 4 is a second block diagram of an image detection apparatus according to an exemplary embodiment.

Fig. 5 is a third block diagram of an image detection apparatus according to an exemplary embodiment.

Fig. 6 is a fourth block diagram of an image detection apparatus according to an exemplary embodiment.

Detailed Description

In order to make the technical solutions of the present disclosure better understood by those of ordinary skill in the art, the technical solutions in the embodiments of the present disclosure will be clearly and completely described below with reference to the accompanying drawings.

It should be noted that the terms "first," "second," and the like in the description and claims of the present disclosure and in the above-described drawings are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the disclosure described herein are capable of operation in sequences other than those illustrated or otherwise described herein. The implementations described in the exemplary embodiments below are not intended to represent all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with certain aspects of the present disclosure, as detailed in the appended claims.

Fig. 1 is a flowchart illustrating an image detection method according to an exemplary embodiment, where the image detection method includes the following steps, as shown in fig. 1:

in step S11, a feature map of the target image is acquired.

In the embodiment of the present disclosure, in order to detect a target image, first, features of the target image need to be extracted, and a feature map of the target image is obtained. The features of the target image may be extracted in any available manner, and the embodiment of the present disclosure is not limited thereto.

For example, the feature of the target image may be extracted through a feature extraction Network such as VGG (Visual Geometry Group) 16, inclusion v1, inclusion v2, Residual Neural Network (Residual Neural Network), inclusion-response, and the like, so as to obtain a feature map of the target image.

In step S12, a first detection result of the target image is obtained through an image detection model according to the feature map, where the first detection result includes a detection frame included in the target image, a first prediction type of the detection frame, and a confidence of the first prediction type.

After the feature map of the target image is obtained, a first detection result of the target image can be obtained through an image detection model according to the feature map. The image detection model may be any model that can be used for image detection, such as a one-stage yolo (young Only Look once) model, an SSD model, a two-stage fasternn model, an RFCN model, a Cascade RCNN model, and the like.

In addition, the target image to be detected may generally be divided into a background region and a foreground region, where the foreground region may also include images of a plurality of different objects, and therefore, in the embodiment of the present disclosure, in order to facilitate a relevant user to correspondingly know a detection region in the target image corresponding to each prediction category and the accuracy of each prediction category, a detection result obtained by using the image detection model may include, but is not limited to, a detection frame included in the target image, a first prediction category of the detection frame, and a confidence of the first prediction category.

In step S13, in response to that the confidence of the first prediction category satisfies a first threshold condition, a target detection box corresponding to the first prediction category is acquired.

As described above, due to poor adaptability of the image detection model, the influence of inaccurate bbox regression in the image detection model on the detection result is large in some scenes, which easily affects the accuracy of the detection result, so that the confidence of the prediction category of some detection frames is low.

Therefore, in the embodiment of the present disclosure, in order to improve the accuracy of the finally obtained prediction classification, a preset threshold may be preset, and then the detection frames with confidence degrees meeting the preset threshold are classified again. For example, preset thresholds of [0.3, 0.9] and the like may be set.

Specifically, in response to that the confidence of the first prediction category satisfies the first threshold condition, the target detection frame corresponding to the corresponding first prediction category may be acquired. And for the first prediction category of which the confidence coefficient does not meet the first threshold condition, the corresponding first prediction category can be directly used as the final prediction category of the corresponding detection frame.

For example, it is assumed that obtaining a first detection result of the target image by the image detection model according to the feature map of the target image includes: the first prediction class b1 and confidence c1 of the detection box a1, and the first prediction class b2 and confidence c2 of the detection box a 2. If the first threshold condition is [0.3, 0.9], and c1 is greater than 0.3 and less than 0.9, and c2 is greater than 0.9, then inspection box a1 may be obtained as the target inspection box, and for inspection box a2, the first prediction class b2 may be taken directly as its final prediction class.

In step S14, extracting image features of the target detection frame through a preset classification model, and obtaining a final prediction category of the target detection frame based on the image features; wherein the classification model is independent of the image detection model and is cascaded with the image detection model.

For the target detection frame with the confidence coefficient meeting the first threshold condition, the image features of the target detection frame can be further extracted through a preset classification model, and the final prediction category of the corresponding target detection frame is obtained based on the image features of the target detection frame.

Moreover, in the embodiments of the present disclosure, in order to improve the accuracy of the classification model, the classification model may be set independently of the image detection model and cascaded with the image detection model. That is, the input data of the classification model includes the target detection frames output by the image detection model and having the confidence level satisfying the first threshold condition. Moreover, after the classification model obtains the target detection frame, the classification model may further extract image features of a region corresponding to the corresponding detection frame in the target image, and identify the corresponding target detection frame based on the image features corresponding to the currently extracted detection frame to obtain a final prediction category of the corresponding target detection frame. At the moment, the classification model is completely independent of a detection frame regression task of the image detection model, has good translation invariance, has good effect on a final classification result, and can greatly improve the detection accuracy.

The type of the model for performing feature extraction in the classification model may be the same as or different from the type of the feature extraction network when the feature map of the target image is obtained in step S11, and the embodiment of the present disclosure is not limited thereto. However, in order to ensure that the image features extracted by the classification model do not completely match the image features corresponding to the corresponding detection frames in the feature map, when the model type for performing feature extraction in the classification model matches the type of the feature extraction network used when the feature map of the target image is acquired in step S11, the model parameters for performing feature extraction in the classification model do not completely match the parameters of the feature extraction network used when the feature map of the target image is acquired in step S11.

In addition, in the embodiment of the present disclosure, when the final prediction category of the target detection frame is obtained through the preset classification model, the final prediction category of the target detection frame may also be obtained by combining the image feature of the target detection frame in the feature map extracted in step S11 with the image feature of the target detection frame extracted based on the classification model, which is not limited in the embodiment of the present disclosure.

Referring to fig. 2, in the embodiment of the present disclosure, the classification model includes N cascaded feature extraction models, where N is a positive integer. The step S14 may further include:

In the embodiment of the present disclosure, in order to further improve the accuracy of the image detection result, the emphasis is on improving the accuracy of the extracted image features. Therefore, in the embodiment of the present disclosure, in order to improve the accuracy of the classification model, N cascaded feature extraction models may be included in the classification model, where N is a positive integer.

Then, when the final prediction classification of each target detection frame is obtained based on the classification model, the image features of the target detection frame can be extracted for the target feature extraction module by the first feature extraction model, further, a current prediction category of the target detection frame and a first confidence degree of the current prediction category are obtained based on all image features of the target detection frame which are extracted currently, if the first confidence coefficient meets a second threshold condition, continuing to use a subsequent feature extraction model of the target feature extraction module as a target feature extraction model, and extracting the image features of the target detection frame through the target feature extraction model, then enter a2 until the target feature extraction module extracts the model for the nth feature, and if the first confidence does not satisfy the second threshold condition, the current prediction category may be used as the final prediction category of the target detection box.

For example, assuming that the classification model includes 3 cascaded feature extraction models, for the target detection frame a1, the image feature of the target detection frame a1 corresponding to the region in the target image, that is, the image feature of the target detection frame a1, may be extracted by the first feature extraction model, and assumed as the image feature set Q1, and then the current prediction class d1 of the target detection frame a1 and the first confidence e1 of the current prediction class d1 may be obtained based on all the image features of the currently extracted target detection frame a1, that is, the image feature set Q1; if e1 satisfies the second threshold condition, the image features of the target detection box a1 may be continuously extracted by the second feature extraction model, which is assumed to be the image feature set Q2, and then the current prediction category d2 of the target detection box a1 and the first confidence e2 of the current prediction category d2 may be obtained based on all the image features of the currently extracted target detection box a1, that is, the union of the image feature set Q1 and the image feature set Q2; and if e1 does not satisfy the second threshold condition, the current prediction class d1 may be used as the final prediction class for the target detection box a 1.

If the current classification model includes the image features of the nth feature extraction model extracted target detection frame a1, which is assumed to be the image feature set Qn, then the current prediction category dn of the target detection frame a1 and the first confidence en of the current prediction category dn may be obtained based on all the image features of the current extracted target detection frame a1, that is, the union of the image feature sets Q1, Q2, …, Qn, and then the current prediction category dn may be used as the final prediction category of the target detection frame a1 regardless of whether the first confidence en satisfies the second threshold condition.

Alternatively, in the embodiment of the present disclosure, if the prediction confidence of the target detection frame obtained by the classification model always satisfies the second threshold condition, the magnitudes of the respective confidences may also be compared, so as to obtain the prediction category with the highest first confidence as the final prediction category of the corresponding target detection frame. For example, for the above-mentioned target detection frame a1, if the first confidence degrees e1, e2, …, and en all satisfy the second threshold condition, then the maximum value e _ max of e1, e2, e …, and en may be obtained, and then the prediction category corresponding to e _ max may be obtained as the final prediction category of the target detection frame a 1.

The second threshold condition may be preset according to a requirement, and the second threshold condition may be the same as or different from the first threshold condition, which is not limited in this embodiment of the disclosure.

Referring to fig. 2, in the embodiment of the present disclosure, before step S14, the method may further include:

step S15, acquiring a plurality of training sample images which are manually calibrated, and performing data enhancement on the training sample images to obtain a training sample image set;

and step S16, training the classification model according to the training sample image set, wherein the training sample images corresponding to any two feature extraction models are not completely consistent.

In the embodiment of the present disclosure, in order to improve the accuracy of the classification model, the classification model may be trained through a plurality of training sample images that are manually calibrated, and in order to improve the diversity of the training sample images and reduce the workload of manual calibration, data enhancement may be performed on the obtained plurality of training sample images that are manually calibrated, so as to obtain a final training sample image set.

The training sample image that is manually calibrated may be subjected to data enhancement in any available manner, and the embodiment of the present disclosure is not limited thereto. For example, data enhancement of the training sample image may be performed by geometric transformations (e.g., Flipping, translation, rotation, scaling, etc.), noising, generating a countering network, randomly adjusting brightness and/or contrast, cropping, padding, blurring, etc.

And then, training classification models according to the training sample image set, and in order to ensure that the feature extraction effect of each feature extraction model in the classification models is not completely consistent, so as to improve the diversity and comprehensiveness of the image features extracted by the classification models, the training sample images corresponding to any two feature extraction models in the classification models can be set to be not completely consistent.

Referring to fig. 2, in an embodiment of the present disclosure, the step S11 may further include: and zooming the target image into a first image with a preset size, and acquiring a feature map of the first image through a preset feature extraction network.

Referring to fig. 2, in the embodiment of the present disclosure, in the case that the image detection model is a two-stage fast rcnn model, the step S12 further may include:

step S121, inputting the feature map into a region generation network of the image detection model to perform category regression and detection frame regression on the feature map to obtain a suggested detection frame of the target image;

step S122, a characteristic region corresponding to the suggested detection frame is intercepted from the characteristic diagram;

step S123, inputting the characteristic region into a pooling layer of the image detection model, so as to perform category regression and detection frame regression on the characteristic region, and obtain a first detection result of the target image.

Taking a two-stage fast rcnn model as an example, the features of the target image can be extracted first at this time to obtain a feature map of the target image. After the feature map of the target image is obtained, the feature map may be input into a Region generation Network (RPN) of the image detection model, so as to perform category regression and detection frame regression on the feature map, thereby obtaining a suggested detection frame of the target image.

In turn, feature regions corresponding to the suggested detection boxes may be extracted from the feature map, and then the feature regions are input into a pooling (roi posing) layer to perform category regression and detection box regression on the respective feature regions.

Wherein the region-forming network can be understood as belonging to a first segmented model in the two-segmented fast rcnn model, and the pooling layer can be understood as belonging to a second segmented model in the two-segmented fast rcnn model.

Optionally, in an embodiment of the present disclosure, the step S121 further includes:

step T1, inputting the feature map into a region generation network of the image detection model, so as to perform category regression and detection frame regression on the feature map to obtain an initial detection frame of the target image;

and T2, performing non-maximum suppression on the initial detection frame to obtain the suggested detection frame.

In addition, in practical application, some inaccurate detection frames may be included in the suggested detection frames obtained after the category regression and the detection frame regression performed based on the RPN, so as to affect the accuracy of the image detection result. Therefore, in the embodiment of the present disclosure, in order to further improve the accuracy of the image detection result, Non-Maximum Suppression (NMS) may be performed on an initial detection frame of the target image obtained after performing category regression and detection frame regression through the area generation network, so as to obtain a final suggested detection frame. Furthermore, in the disclosed embodiments, non-maximum suppression may be performed in any available manner, and the disclosed embodiments are not limited thereto.

In the embodiment of the present disclosure, the classification model includes N cascaded feature extraction models, where N is a positive integer. When the final prediction type of the target detection frame is acquired, the partial image features of the target detection frame may be extracted by adding one feature extraction model each time, and the current prediction type of the target detection frame and the first confidence of the current prediction type may be acquired based on all the image features of the currently extracted target detection frame until the current target extraction module is the nth feature extraction model or the current confidence does not satisfy the second threshold condition. Thereby, the accuracy of the image detection result can be further improved.

Moreover, in the embodiment of the disclosure, a plurality of training sample images which are manually calibrated can be obtained, and data enhancement is performed on the training sample images to obtain a training sample image set; and training the classification model according to the training sample image set, wherein the training sample images corresponding to any two feature extraction models are not completely consistent. Thereby improving the accuracy of the classification model.

In addition, in the embodiment of the present disclosure, the target image may be scaled to a first image with a preset size, and a feature map of the first image is acquired through a preset feature extraction network, so as to seek a balance between a detection speed and a detection accuracy.

Further, in the embodiment of the present disclosure, the feature map may be input into a region generation network of the image detection model, so as to perform category regression and detection frame regression on the feature map, and obtain a suggested detection frame of the target image; a feature region corresponding to the suggested detection box is intercepted from the feature map; and inputting the characteristic region into a pooling layer of the image detection model to perform category regression and detection frame regression on the characteristic region to obtain a first detection result of the target image. Inputting the feature map into a region generation network of the image detection model to perform category regression and detection frame regression on the feature map to obtain an initial detection frame of the target image; and carrying out non-maximum suppression on the initial detection frame to obtain the suggested detection frame. The accuracy of the output result of the image detection model can be further improved, and the accuracy of the image detection result can be improved while the workload of the classification model is reduced.

Fig. 3 is a block diagram illustrating an image detection apparatus according to an exemplary embodiment. Referring to fig. 3, the apparatus includes a feature map acquisition module 21, a first image detection module 22, an object detection frame acquisition module 23, and a second image detection module 24.

A feature map acquisition module 21 configured to perform acquiring a feature map of the target image.

A first image detection module 22 configured to perform obtaining, according to the feature map, a first detection result of the target image through an image detection model, where the first detection result includes a detection frame included in the target image, a first prediction category of the detection frame, and a confidence of the first prediction category.

And the target detection frame acquisition module 23 is configured to execute, in response to that the confidence of the first prediction category satisfies a first threshold condition, acquiring a target detection frame corresponding to the first prediction category.

The second image detection module 24 is configured to extract image features of the target detection frame through a preset classification model, and obtain a final prediction category of the target detection frame based on the image features; wherein the classification model is independent of the image detection model and is cascaded with the image detection model.

Referring to fig. 4, in the embodiment of the present disclosure, the classification model includes N cascaded feature extraction models, where N is a positive integer, and the second image detection module 24 further includes:

a first image feature extraction sub-module 241, configured to extract the image features of the target detection frame by using the first feature extraction model as a target feature extraction module according to the cascade order of the feature extraction models;

a detection frame classification sub-module 242 configured to perform obtaining a current prediction category of the target detection frame and a first confidence of the current prediction category based on all image features of the target detection frame extracted currently;

a first image feature extraction sub-module 243, configured to perform, in response to that the first confidence degree meets a second threshold condition, taking a next feature extraction model of the target feature extraction module as a target feature extraction model, extracting image features of the target detection frame through the target feature extraction model, and then entering the detection frame classification sub-module 242 until the target feature extraction module is an nth feature extraction model;

a final prediction category validation sub-module 244 configured to perform, in response to the first confidence not satisfying a second threshold condition, taking the current prediction category as a final prediction category of the target detection box.

Referring to fig. 4, in an embodiment of the present disclosure, the image detection apparatus may further include:

the training sample image obtaining module 25 is configured to obtain a plurality of training sample images that are manually calibrated, and perform data enhancement on the training sample images to obtain a training sample image set.

And the classification model training module 26 is configured to execute training of the classification model according to the training sample image set, and training sample images corresponding to any two feature extraction models are not completely consistent.

Optionally, in this embodiment of the present disclosure, the feature map obtaining module may be further configured to scale the target image into a first image with a preset size, and obtain the feature map of the first image through a preset feature extraction network.

Referring to fig. 4, in the embodiment of the present disclosure, in the case that the image detection model is a two-stage fast rcnn model, the first image detection module 22 may further include:

a suggested detection frame obtaining sub-module 221, configured to perform a region generation network for inputting the feature map into the image detection model, so as to perform category regression and detection frame regression on the feature map, so as to obtain a suggested detection frame of the target image;

a feature region screenshot sub-module 222 configured to execute feature regions corresponding to the proposed detection boxes that are truncated from the feature map;

the detection result obtaining sub-module 223 is configured to perform a pooling layer of the image detection model to input the feature region, so as to perform category regression and detection frame regression on the feature region, and obtain a first detection result of the target image.

Optionally, in this embodiment of the present disclosure, the suggestion detection frame obtaining sub-module further may include:

With regard to the apparatus in the above-described embodiment, the specific manner in which each module performs the operation has been described in detail in the embodiment related to the method, and will not be elaborated here.

Fig. 5 is a block diagram illustrating an apparatus 300 for image detection according to an exemplary embodiment. For example, the apparatus 300 may be a mobile phone, a computer, a digital broadcast terminal, a messaging device, a game console, a tablet device, a medical device, an exercise device, a personal digital assistant, and the like.

Referring to fig. 5, the apparatus 300 may include one or more of the following components: a processing component 302, a memory 304, a power component 306, a multimedia component 308, an audio component 310, an input/output (I/O) interface 312, a sensor component 314, and a communication component 316.

The processing component 302 generally controls overall operation of the device 300, such as operations associated with display, telephone calls, data communications, camera operations, and recording operations. The processing components 302 may include one or more processors 320 to execute instructions to perform all or a portion of the steps of the methods described above. Further, the processing component 302 can include one or more modules that facilitate interaction between the processing component 302 and other components. For example, the processing component 302 may include a multimedia module to facilitate interaction between the multimedia component 308 and the processing component 302.

The memory 304 is configured to store various types of data to support operations at the device 300. Examples of such data include instructions for any application or method operating on device 300, contact data, phonebook data, messages, pictures, videos, and so forth. The memory 304 may be implemented by any type or combination of volatile or non-volatile memory devices, such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disks.

The power supply component 306 provides power to the various components of the device 300. The power components 306 may include a power management system, one or more power supplies, and other components associated with generating, managing, and distributing power for the apparatus 300.

The multimedia component 308 includes a screen that provides an output interface between the device 300 and a user. In some embodiments, the screen may include a Liquid Crystal Display (LCD) and a Touch Panel (TP). If the screen includes a touch panel, the screen may be implemented as a touch screen to receive an input signal from a user. The touch panel includes one or more touch sensors to sense touch, slide, and gestures on the touch panel. The touch sensor may not only sense the boundary of a touch or slide action, but also detect the duration and pressure associated with the touch or slide operation. In some embodiments, the multimedia component 308 includes a front facing camera and/or a rear facing camera. The front camera and/or the rear camera may receive external multimedia data when the device 300 is in an operating mode, such as a shooting mode or a video mode. Each front camera and rear camera may be a fixed optical lens system or have a focal length and optical zoom capability.

The audio component 310 is configured to output and/or input audio signals. For example, audio component 310 includes a Microphone (MIC) configured to receive external audio signals when apparatus 300 is in an operating mode, such as a call mode, a recording mode, and a voice recognition mode. The received audio signals may further be stored in the memory 304 or transmitted via the communication component 316. In some embodiments, audio component 310 also includes a speaker for outputting audio signals.

The I/O interface 312 provides an interface between the processing component 302 and peripheral interface modules, which may be keyboards, click wheels, buttons, etc. These buttons may include, but are not limited to: a home button, a volume button, a start button, and a lock button.

The sensor assembly 314 includes one or more sensors for providing various aspects of status assessment for the device 300. For example, sensor assembly 314 may detect an open/closed state of device 300, the relative positioning of components, such as a display and keypad of apparatus 300, the change in position of apparatus 300 or a component of apparatus 300, the presence or absence of user contact with apparatus 300, the orientation or acceleration/deceleration of apparatus 300, and the change in temperature of apparatus 300. Sensor assembly 314 may include a proximity sensor configured to detect the presence of a nearby object without any physical contact. The sensor assembly 314 may also include a light sensor, such as a CMOS or CCD image sensor, for use in imaging applications. In some embodiments, the sensor assembly 314 may also include an acceleration sensor, a gyroscope sensor, a magnetic sensor, a pressure sensor, or a temperature sensor.

The communication component 316 is configured to facilitate wired or wireless communication between the apparatus 300 and other devices. The apparatus 300 may access a wireless network based on a communication standard, such as WiFi, an operator network (such as 2G, 3G, 4G, or 5G), or a combination thereof. In an exemplary embodiment, the communication component 316 receives a broadcast signal or broadcast related information from an external broadcast management system via a broadcast channel. In an exemplary embodiment, the communication component 316 further includes a Near Field Communication (NFC) module to facilitate short-range communications. For example, the NFC module may be implemented based on Radio Frequency Identification (RFID) technology, infrared data association (IrDA) technology, Ultra Wideband (UWB) technology, Bluetooth (BT) technology, and other technologies.

In an exemplary embodiment, the apparatus 300 may be implemented by one or more Application Specific Integrated Circuits (ASICs), Digital Signal Processors (DSPs), Digital Signal Processing Devices (DSPDs), Programmable Logic Devices (PLDs), Field Programmable Gate Arrays (FPGAs), controllers, micro-controllers, microprocessors or other electronic components for performing the above-described methods.

In an exemplary embodiment, a storage medium comprising instructions, such as the memory 304 comprising instructions, executable by the processor 320 of the apparatus 300 to perform the method described above is also provided. Alternatively, the storage medium may be a non-transitory computer readable storage medium, which may be, for example, a ROM, a Random Access Memory (RAM), a CD-ROM, a magnetic tape, a floppy disk, an optical data storage device, and the like.

Fig. 6 is a block diagram illustrating an apparatus 400 for image detection according to an exemplary embodiment. For example, the apparatus 400 may be provided as a server. Referring to fig. 6, apparatus 400 includes a processing component 422, which further includes one or more processors, and memory resources, represented by memory 432, for storing instructions, such as applications, that are executable by processing component 422. The application programs stored in memory 432 may include one or more modules that each correspond to a set of instructions. Further, the processing component 422 is configured to execute instructions to perform the method … … described above

The apparatus 400 may also include a power component 426 configured to perform power management of the apparatus 400, a wired or wireless network interface 450 configured to connect the apparatus 400 to a network, and an input output (I/O) interface 458. The apparatus 400 may operate based on an operating system stored in the memory 432, such as Windows Server, Mac OS XTM, UnixTM, LinuxTM, FreeBSDTM, and the like.

Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This disclosure is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

It will be understood that the present disclosure is not limited to the precise arrangements described above and shown in the drawings and that various modifications and changes may be made without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

The present disclosure discloses a b1. an image detection method, comprising:

acquiring a characteristic diagram of a target image;

B2. The method according to B1, wherein the classification model includes N cascaded feature extraction models, where N is a positive integer, and the step of extracting the image feature of the target detection frame through a preset classification model and obtaining the final prediction category of the target detection frame based on the image feature includes:

B3. The method according to B2, further comprising, before the step of extracting image features of the target detection frame through a preset classification model and obtaining a final predicted category of the target detection frame based on the image features, the step of:

B4. The method of B1, wherein the step of obtaining the feature map of the target image comprises:

B5. The method according to B1, wherein, in a case that the image detection model is a two-stage fast rcnn model, the step of obtaining a first detection result of the target image according to the feature map through an image detection model includes:

B6. The method according to B5, wherein the step of inputting the feature map into the area generation network of the image detection model to perform category regression and detection box regression on the feature map to obtain a suggested detection box of the target image includes:

The present disclosure also discloses C7. an image detection apparatus, comprising:

C8. The apparatus according to C7, wherein the classification model includes N cascaded feature extraction models, where N is a positive integer, and the second image detection module includes:

C9. The apparatus of C8, the apparatus further comprising:

C10. The apparatus according to C7, wherein the feature map obtaining module is further configured to scale the target image into a first image with a preset size, and obtain the feature map of the first image through a preset feature extraction network.

C11. The apparatus according to C7, wherein the first image detection module comprises, in the case that the image detection model is a two-stage fast rcnn model:

C12. The apparatus of C11, the suggestion detection box acquisition sub-module, comprising:

The present disclosure also discloses d13. an image detection apparatus, comprising:

a processor;

a memory for storing the processor-executable instructions;

wherein the processor is configured to execute the instructions to implement the image detection method of any one of B1-B6.

The present disclosure also discloses an e14 storage medium in which instructions, when executed by a processor of an image detection apparatus, enable the image detection apparatus to perform the image detection method as set forth in any one of B1 to B6.

Claims

1. An image detection method, comprising:

acquiring a characteristic diagram of a target image;

2. The method according to claim 1, wherein the classification model includes N cascaded feature extraction models, where N is a positive integer, and the step of extracting the image feature of the target detection frame through a preset classification model and obtaining the final prediction class of the target detection frame based on the image feature includes:

3. The method according to claim 2, further comprising, before the step of extracting image features of the target detection frame through a preset classification model and obtaining a final prediction class of the target detection frame based on the image features, the step of:

4. The method of claim 1, wherein the step of obtaining the feature map of the target image comprises:

5. The method according to claim 1, wherein in the case that the image detection model is a two-stage fast rcnn model, the step of obtaining the first detection result of the target image through the image detection model according to the feature map comprises:

6. The method according to claim 5, wherein the step of inputting the feature map into the region generation network of the image detection model to perform category regression and detection box regression on the feature map to obtain the suggested detection box of the target image comprises:

7. An image detection apparatus, characterized by comprising:

8. The apparatus of claim 7, wherein the classification model comprises N cascaded feature extraction models, N being a positive integer, and the second image detection module comprises:

9. An image detection apparatus, characterized by comprising:

a processor;

a memory for storing the processor-executable instructions;

wherein the processor is configured to execute the instructions to implement the image detection method of any one of claims 1 to 6.

10. A storage medium in which instructions, when executed by a processor of an image detection apparatus, enable the image detection apparatus to perform the image detection method according to any one of claims 1 to 6.