CN111738249B

CN111738249B - Image detection method, image detection device, electronic equipment and storage medium

Info

Publication number: CN111738249B
Application number: CN202010867104.8A
Authority: CN
Inventors: 张子浩; 李兵; 杨家博
Original assignee: Beijing Yizhen Xuesi Education Technology Co Ltd
Current assignee: Beijing Yizhen Xuesi Education Technology Co Ltd
Priority date: 2020-08-26
Filing date: 2020-08-26
Publication date: 2020-12-08
Anticipated expiration: 2040-08-26
Also published as: CN111738249A

Abstract

The application provides an image detection method, an image detection device, electronic equipment and a storage medium. The specific implementation scheme is as follows: extracting the characteristics of an image to be detected to obtain a characteristic image to be detected; processing the characteristic image to be detected by utilizing the convolution layer to obtain the probability that pixel points in the image to be detected belong to a text prediction frame, the prediction position of the text prediction frame to which the pixel points belong and the prediction position of a question stem frame matched with the pixel points in an answer frame; the text prediction box comprises a question stem box and an answer box; and obtaining the matching relation between the question stem frame and the answer frame of the image to be detected according to the probability that the pixel points belong to the text prediction frame, the prediction positions of the text prediction frame to which the pixel points belong and the prediction positions of the question stem frame matched with the pixel points in the answer frame. According to the embodiment of the application, the matching relation between the question stem frame and the answer frame can be detected from the image to be detected, so that the subsequent question judging process is facilitated, and the problem of missed judgment caused by the limitation of the fixed area of the answer frame can be avoided.

Description

Image detection method, image detection device, electronic equipment and storage medium

Technical Field

The present application relates to the field of information technology, and in particular, to an image detection method and apparatus, an electronic device, and a storage medium.

Background

At present, in an educational and teaching scene, most of students 'homework or examination papers still adopt a manual mode to evaluate and read, the workload of evaluating and reading the students' homework or examination papers is large, and huge workload is brought to parents and teachers. For such phenomena and problems, with the continuous development of computing technologies and artificial intelligence technologies, artificial intelligence technologies have been gradually applied in educational and teaching scenes. In some large-scale education scenes, various methods and systems for automatically judging questions and automatically scoring papers have been popularized.

In the related photo-taking question-judging or automatic question-judging system, most of the answer cards scribbled by students are identified by computers, and the students only need to scribble answers or options in the form of pencils. The method invisibly increases the answering time cost for students, and even easily causes filling errors and missing coating. In addition, some subjective questions require students to write answers in a fixed area of an answer box, and a missed-judgment question is generated when the handwritten answers exceed the answer area.

Disclosure of Invention

The embodiment of the application provides an image detection method, an image detection device, electronic equipment and a storage medium, which are used for solving the problems in the related art, and the technical scheme is as follows:

in a first aspect, an embodiment of the present application provides an image detection method, including:

extracting the characteristics of the image to be detected to obtain a characteristic image to be detected;

processing the characteristic image to be detected by utilizing the convolution layer to obtain the probability that pixel points in the image to be detected belong to a text prediction frame, the prediction position of the text prediction frame to which the pixel points belong and the prediction position of a question stem frame matched with the pixel points in an answer frame; the text prediction box comprises a question stem box and an answer box;

and obtaining the matching relation between the question stem frame and the answer frame of the image to be detected according to the probability that the pixel points belong to the text prediction frame, the prediction positions of the text prediction frame to which the pixel points belong and the prediction positions of the question stem frame matched with the pixel points in the answer frame.

In one embodiment, the performing feature extraction on an image to be detected to obtain a feature image to be detected includes:

inputting an image to be detected into a residual error neural network model to obtain a multilayer characteristic image;

and performing convolution calculation and up-sampling on the upper layer characteristic image in the multilayer characteristic image in sequence by using the characteristic image pyramid network, and performing splicing operation on the upper layer characteristic image and the corresponding lower layer characteristic image after the convolution calculation and the up-sampling to obtain the characteristic image to be detected.

In one embodiment, the prediction position of the text prediction box to which the pixel point belongs includes: the distance from the pixel point to each side of the text prediction box and the prediction angle of the text prediction box relative to the image to be detected.

In one embodiment, the predicted positions of the stem frames matched with the pixel points in the answer frame include: and the coordinate value of the central point of the question stem frame matched with the pixel point in the answer frame.

In one embodiment, obtaining a matching relationship between a stem frame and an answer frame of an image to be detected according to a probability that a pixel point belongs to a text prediction frame, a prediction position of the text prediction frame to which the pixel point belongs, and a prediction position of the stem frame matched with the pixel point in the answer frame, includes:

obtaining a question stem frame set and an answer frame set of the image to be detected according to the probability that the pixel points belong to the text prediction frame and the prediction position of the text prediction frame to which the pixel points belong;

and obtaining the matching relation between the question stem frame and the answer frame of the image to be detected according to the question stem frame set, the answer frame set and the predicted position of the question stem frame matched with the pixel points in the answer frame.

In one embodiment, obtaining a stem frame set and an answer frame set of an image to be detected according to the probability that a pixel point belongs to a text prediction frame and the prediction position of the text prediction frame to which the pixel point belongs, includes:

determining pixel points belonging to the text prediction box in the image to be detected according to the probability that the pixel points belong to the text prediction box;

aiming at the pixel points belonging to the text prediction box, obtaining the prediction positions of the text prediction box to which the pixel points belong;

and obtaining a stem frame set and an answer frame set of the image to be detected by utilizing a non-maximum suppression algorithm according to the obtained prediction position of the text prediction frame to which the pixel point belongs.

In one embodiment, obtaining a matching relationship between a stem frame and an answer frame of an image to be detected according to a stem frame set, an answer frame set, and a predicted position of a stem frame matched with a pixel point in the answer frame includes:

according to the predicted position of the question stem frame matched with the pixel points in the answer frame, the predicted position of the question stem frame matched with the answer frame is obtained by utilizing a clustering algorithm;

calculating Euclidean distance between the predicted position of the stem frame matched with the answer frame and the central point of each stem frame in the stem frame set;

and taking the question stem frame corresponding to the minimum Euclidean distance as the question stem frame matched with the answer frame.

In one embodiment, the method further comprises training the convolutional layer in at least one of:

calculating the loss value of the probability of the pixel point belonging to the text prediction box by using the Daiss coefficient difference function;

calculating a loss value of a prediction position of a text prediction box to which the pixel point belongs by using an intersection ratio loss function;

the loss value of the predicted position of the stem frame matching the pixel point in the answer frame is calculated by using the smooth L1 loss function.

In a second aspect, an embodiment of the present application provides an image detection apparatus, including:

the extraction unit is used for extracting the characteristics of the image to be detected to obtain a characteristic image to be detected;

the processing unit is used for processing the characteristic image to be detected by utilizing the convolution layer to obtain the probability that pixel points in the image to be detected belong to the text prediction frame, the prediction position of the text prediction frame to which the pixel points belong and the prediction position of the question stem frame matched with the pixel points in the answer frame; the text prediction box comprises a question stem box and an answer box;

and the detection unit is used for obtaining the matching relation between the question stem frame and the answer frame of the image to be detected according to the probability that the pixel point belongs to the text prediction frame, the prediction position of the text prediction frame to which the pixel point belongs and the prediction position of the question stem frame matched with the pixel point in the answer frame.

In one embodiment, the extraction unit is configured to:

In one embodiment, the detection unit comprises:

the first detection subunit is used for obtaining a question stem frame set and an answer frame set of the image to be detected according to the probability that the pixel point belongs to the text prediction frame and the prediction position of the text prediction frame to which the pixel point belongs;

and the second detection subunit is used for obtaining the matching relation between the question stem frame and the answer frame of the image to be detected according to the question stem frame set, the answer frame set and the predicted position of the question stem frame matched with the pixel points in the answer frame.

In one embodiment, the first detection subunit is configured to:

In one embodiment, the second detection subunit is configured to:

In one embodiment, the apparatus further comprises a training unit for training the convolutional layer in at least one of the following ways:

In a third aspect, an embodiment of the present application provides an electronic device, including: a memory and a processor. Wherein the memory and the processor are in communication with each other via an internal connection path, the memory is configured to store instructions, the processor is configured to execute the instructions stored by the memory, and the processor is configured to perform the method of any of the above aspects when the processor executes the instructions stored by the memory.

In a fourth aspect, embodiments of the present application provide a computer-readable storage medium, which stores a computer program, and when the computer program runs on a computer, the method in any one of the above-mentioned aspects is executed.

The advantages or beneficial effects in the above technical solution at least include: the matching relation between the question stem frame and the answer frame can be detected from the image to be detected, so that the subsequent question judging process is facilitated, and the problem of missed judgment caused by the limitation of the fixed area of the answer frame can be avoided.

The foregoing summary is provided for the purpose of description only and is not intended to be limiting in any way. In addition to the illustrative aspects, embodiments, and features described above, further aspects, embodiments, and features of the present application will be readily apparent by reference to the drawings and following detailed description.

Drawings

In the drawings, like reference numerals refer to the same or similar parts or elements throughout the several views unless otherwise specified. The figures are not necessarily to scale. It is appreciated that these drawings depict only some embodiments in accordance with the disclosure and are therefore not to be considered limiting of its scope.

FIG. 1 is a flow chart of an image detection method according to an embodiment of the present application;

FIG. 2 is a network architecture diagram of an image detection method according to another embodiment of the present application;

FIG. 3a is an illustration of an image to be detected according to a further embodiment of the present application;

FIG. 3b is a schematic diagram of a predicted value of a cls map corresponding to the image to be detected in FIG. 3 a;

FIG. 4 is a diagram illustrating predicted values of the rbox map of an image detection method according to another embodiment of the present application;

FIG. 5 is a diagram illustrating a prediction value of near map of an image detection method according to another embodiment of the present application;

FIG. 6 is a flow chart of an image detection method according to another embodiment of the present application;

FIG. 7 is a flow chart of an image detection method according to another embodiment of the present application;

FIG. 8 is a flow chart of an image detection method according to another embodiment of the present application;

FIG. 9 is a flow chart of a post-processing procedure of an image detection method according to another embodiment of the present application;

FIG. 10 is a schematic structural diagram of an image detection apparatus according to an embodiment of the present application;

FIG. 11 is a schematic structural diagram of a detecting unit of an image detecting apparatus according to an embodiment of the present application;

FIG. 12 is a schematic structural diagram of an image detection apparatus according to another embodiment of the present application;

FIG. 13 is a block diagram of an electronic device used to implement embodiments of the present application.

Detailed Description

In the following, only certain exemplary embodiments are briefly described. As those skilled in the art will recognize, the described embodiments may be modified in various different ways, all without departing from the spirit or scope of the present application. Accordingly, the drawings and description are to be regarded as illustrative in nature, and not as restrictive.

Fig. 1 is a flowchart of an image detection method according to an embodiment of the present application. As shown in fig. 1, the image detection method may include:

step S110, extracting the characteristics of an image to be detected to obtain a characteristic image to be detected;

step S120, processing the characteristic image to be detected by utilizing the convolution layer to obtain the probability that pixel points in the image to be detected belong to a text prediction frame, the prediction position of the text prediction frame to which the pixel points belong and the prediction position of a question stem frame matched with the pixel points in an answer frame; the text prediction box comprises a question stem box and an answer box;

and step S130, obtaining the matching relation between the question stem frame and the answer frame of the image to be detected according to the probability that the pixel point belongs to the text prediction frame, the prediction position of the text prediction frame to which the pixel point belongs and the prediction position of the question stem frame matched with the pixel point in the answer frame.

In the related photo-taking or automatic question-judging system, objective questions need to be written in a fixed area of an answer sheet, and subjective questions need to be written in a fixed area of an answer frame. The mode is easy to generate the phenomena of filling errors and missing coating. For example, a missed question may be generated when a handwritten answer exceeds the answer area.

In view of this, the present application provides an image detection method, which can detect a matching relationship between a stem frame and an answer frame from an image to be detected. In step S110, feature extraction may be performed on the image to be detected by using the residual neural network model and the feature map pyramid network, so as to obtain a feature image to be detected. In step S120, the feature to be detected is further extracted by using different convolution layers, so as to obtain a plurality of enhanced feature images. The prediction result of each pixel point can be obtained from the enhanced feature image, and the method comprises the following steps: the probability that pixel points in the image to be detected belong to the text prediction box, the prediction positions of the text prediction box to which the pixel points belong, and the prediction positions of the matched question stem boxes corresponding to the pixel points. And in the subsequent steps, the image to be detected can be further processed by utilizing the predicted position of the question stem frame matched with the pixel point in the answer frame. In step S130, a stem frame set and an answer frame set of the image to be detected can be obtained according to the prediction result of each pixel, and further a matching relationship between the stem frame and the answer frame of the image to be detected can be obtained. In the subsequent question judging process, text Recognition can be performed on the text contents in the question stem box and the answer box by adopting an OCR (Optical Character Recognition) technology according to the matching relationship between the question stem box and the answer box.

By applying the image detection method of the embodiment of the application, in a photo-taking question-judging or automatic question-judging system, the matching relation between the question stem frame and the answer frame is detected from the image to be detected, and then the answering content of the student in the answer frame is judged. The subsequent question judging process can be facilitated by detecting the matching relation between the question stem frame and the answer frame. In addition, by applying the image detection method of the embodiment of the application, in a photo-taking question-making or automatic question-making system, the question stem frame set, the answer frame set and the matching relation between the question stem frame and the answer frame can be detected from the image to be detected, and students are not required to write answers in the fixed area of the answer frame when answering questions, so that the problem of missed judgment caused by the limitation of the fixed area of the answer frame can be avoided.

In an embodiment, in step S110 in fig. 1, performing feature extraction on an image to be detected to obtain a feature image to be detected, which may specifically include:

In one example, the network structure diagram shown in fig. 2 may be used to obtain the feature image to be detected. Firstly, an image to be detected is input into a Residual neural Network model (Resnet, Residual Network) to obtain a multilayer characteristic image. The residual error neural network model can be used in the fields of target classification and the like and can be used as a part of a computer vision task main neural network. The residual neural network model includes resnet50, resnet101, and the like. Where resnet50 represents a residual neural network model where the hidden layer is 50 layers. In the example shown in fig. 2, the image to be detected is first input and subjected to feature extraction through Resnet50, and then a multi-layer feature image is obtained. In fig. 2, the feature images in the multi-layered feature image are represented by C1, C2, C3, C4, and C5 from bottom to top, and the sizes of the corresponding feature images are 128 × 128, 64 × 128, 32 × 256, 16 × 512, and 8 × 512, respectively. Taking feature image C2 as an example, size of C2 is 64 × 128, where "64 × 64" indicates that width and height of feature image C2 are 64 pixels, and "128" indicates depth or channel number of feature image C2, that is, feature dimension of feature image C2. In comparison, the semantic information of the lower layer characteristic image is less, but the target position is accurate; semantic information of the upper-layer characteristic image is rich, but the target position is rough.

Referring to fig. 2, the multi-layer Feature images C1, C2, C3, C4, and C5 obtained after the features are extracted by the Resnet50 are processed by a Feature Pyramid Network (FPN). In FPN, a top-down process is used to process the multi-layer feature image. The top-down process is to perform convolution calculation and upsampling (unprooling) on the more abstract and more semantic upper layer feature image, and then perform splicing operation on the upper layer feature image after the convolution calculation and the upsampling and the corresponding lower layer feature image. "conv 3 × 3" in fig. 2 indicates the size of the convolution kernel to which the convolution calculation corresponds. The main purpose of the up-sampling is to enlarge the image, and the feature image with the same size as the lower feature image can be generated by up-sampling the upper feature image. And the splicing operation is to fuse the up-sampling result and the lower layer characteristic image with the same size. The way of fusion may be to add corresponding elements in the feature matrix. The two layers of characteristic images subjected to splicing operation have the same spatial size, so that the result after the splicing operation can utilize lower-layer positioning detail information.

Referring to fig. 2, in FPN, the upper layer feature image C5 with the smallest size is processed first, C5 is subjected to convolution calculation and upsampling, and then the result after the convolution calculation and the upsampling and the corresponding lower layer feature image C4 are subjected to a stitching operation. Similarly, C4 is subjected to convolution calculation and upsampling, and then the result after the convolution calculation and the upsampling and the corresponding lower layer feature image C3 are subjected to a stitching operation. And by analogy, finally performing convolution calculation and upsampling on the C2, and then performing a splicing operation on the result after the convolution calculation and the upsampling and the corresponding lower layer characteristic image 1. The result of the last stitching operation yields a fused Feature image (MF) with dimensions 128 × 256. The above process is iterative until a final fused feature image is generated, i.e., the feature image to be detected for use in subsequent processing. In the iteration process, the characteristics of the upper layer characteristic image are enhanced, the characteristic images used for predicting each layer are fused with the characteristics of different resolutions and different semantic strengths, the detection of the image with the corresponding resolution can be completed, and each layer is ensured to have proper resolution and strong semantic characteristics.

Referring to fig. 2, after the fused feature image is obtained, the feature image to be detected is further processed by different convolution layers passing through three branches, so as to obtain three enhanced feature images, which are respectively a cls map, a rbox map and a near map. For example, the sizes of the cls map, rbox map, near map are 128 × 3, 128 × 5, 128 × 2, respectively. "2 × conv3 × 3" in fig. 2 indicates that the convolution layer of size "3 × 3" using 2 convolution kernels in succession was used to further extract image features.

Wherein the characteristic dimension of the cls map is 3. The feature dimension is also the number of channels, and the 3 channels correspond to the probabilities that the predicted pixel points in the image to be detected belong to the text prediction frame and the background region, wherein the probabilities include the probability that the pixel points belong to the question stem frame, the probability that the pixel points belong to the answer frame and the probability that the pixel points belong to the background region. "classs" is an abbreviation of "classes" (classification result) and is used for indicating that the classification result is predicted by using the convolutional layer, and the classification result comprises one of a question stem frame belonging to a pixel point, an answer frame belonging to a pixel point and a background region belonging to a pixel point.

FIG. 3a is an illustration of an image to be detected according to yet another embodiment of the present application. The label boxes of the stem box and the answer box are also included in fig. 3 a. Fig. 3b is a schematic diagram of the predicted value of the cls map corresponding to the image to be detected in fig. 3 a. The black area in fig. 3b is the background pixel area, the light rectangle frame is the question stem pixel area, and the dark rectangle frame is the answer pixel area. The cls map is a prediction classification map of the image to be detected to the text region. And 3 predicted values are generated for each pixel point in the image to be detected, wherein the 3 predicted values respectively represent the probability that the pixel point belongs to the question stem frame, the probability that the pixel point belongs to the answer frame and the probability that the pixel point belongs to the background area. The above 3 prediction values may be referred to as stem probability, answer probability and background probability, respectively.

In one embodiment, the method further comprises training the convolutional layer generating the cls map in the following manner: and calculating the loss value of the probability of the pixel point belonging to the text prediction box by using the Daiss coefficient difference function.

In one example, the labels used for training may be set to: the probability value of the background of the pixel points in each question stem frame is 0, the probability value of the question stem is 1, and the probability of the answer is 0. In order to solve the problem of sample imbalance, in the training stage, the embodiment of the present application calculates the training loss by using dice loss (dess coefficient difference function) for each channel of the feature image. The Dice coefficient is a set similarity measurement function, and is generally used for calculating the similarity of two samples, and the value range is [0, 1 ]. The Dice coefficient is defined as follows:

where s is a Dice coefficient, | X ≦ Y | is the intersection between X and Y, and | X | and | Y | branch tables represent the number of elements of X and Y, where the coefficient of the numerator is 2 because of the denominator's reason for repeatedly calculating the common element between X and Y.

The equation for the Dice coefficient difference function (Dice loss) is as follows:

wherein d represents a loss value.

In the embodiment of the application, loss values corresponding to the stem probability, the answer probability and the background probability are respectively calculated by adopting dice loss. The above three loss values are respectively expressed as a stem loss Lt, an answer loss Ld, and a background loss Lb. In the process of forward prediction for processing an image to be detected by using a residual error neural network model, a feature map pyramid network and a convolution layer, according to 3 values of predicted stem probability, answer probability and background probability of each pixel point, softmax is adopted for normalization processing, and classification categories corresponding to the maximum probability values are obtained as classification results. The following formula 1 is a formula for calculating loss (loss value) of the classification result.

Equation 1:

Fig. 4 is a schematic diagram of a predicted value of the rbox map of an image detection method according to another embodiment of the present application. "rbox" is an abbreviation for "rotate box" and is used to denote a text prediction box that predicts tilt with angle using convolutional layers. Taking horizontal typesetting as an example, the prediction angle is an included angle of the character direction of the text prediction box relative to the horizontal direction of the image to be detected. When the prediction angle is 0 or the prediction angle is smaller than a certain preset threshold, the presentation mode of the text prediction box in the image to be detected can be considered to be substantially horizontal. Most of the current test paper adopts a form of horizontally typesetting text, and the presentation mode of the text prediction box is horizontal in most cases. Referring to fig. 2 and 4, in one example, the text prediction box may be a rectangle. The dimension of the rbox map is 128 x 5, the characteristic dimension of the rbox map is 5, and each pixel point can predict and regress 5 values and is used for predicting the distance from the pixel point in the image to be detected to the four sides of the text prediction box and the prediction angle of the text prediction box relative to the image to be detected. Referring to fig. 4, 5 values of the prediction regression of each pixel point are: the relative distance LL from the current pixel point to the left of the text prediction box, the relative distance RL from the current pixel point to the right of the text prediction box, the relative distance TL from the current pixel point to the upper side of the text prediction box, the relative distance BL from the current pixel point to the lower side of the text prediction box and the prediction angle TT.

In one embodiment, the method further comprises training the convolutional layer that generates the rbox map in the following manner: and calculating the loss value of the prediction position of the text prediction box to which the pixel point belongs by using the intersection ratio loss function.

In the embodiment of the present application, iou loss (cross-over ratio) can be used to calculate the loss of the regressed LL, RL, TL, BL. The intersection ratio may reflect the detection effect of the predicted text detection box relative to the actual text detection box. For the loss calculation of LL, RL, TL, BL, the embodiment of the present application can use the following formula 3 to perform the calculation.

Equation 3:

wherein, A is the area of the prediction frame, and A is the area of the real frame.

The embodiment of the present application may calculate the loss of the predicted angle by using the following formula 4.

Equation 4:

wherein TT is a prediction angle, and TT is a real angle.

To sum up, the loss value of the predicted position of the text prediction box to which the pixel point belongs can be calculated by using the following formula 5.

Equation 5:

in another embodiment, L2 or smooth L1 loss (smoothed L1 loss function) may also be used to calculate a loss value for the predicted position of the text prediction box. Because the texts in the image to be detected are different in size and length under normal conditions, the effect of calculating the loss value of the text prediction box is better by adopting intersection.

Fig. 5 is a schematic diagram illustrating a prediction value of near map according to an image detection method according to another embodiment of the present application. "near" means neighbor matching and indicates a coordinate value of the center point of the stem frame that is predicted to match a pixel in the answer frame by using the convolution layer. Referring to fig. 2 and 5, the size of the near map is 128 × 2, the characteristic dimension is 2, and each pixel predicts 2 values for regression, and is used for predicting the abscissa value and the ordinate value of the center point of the stem frame matched with the pixel in the image to be detected.

In the application scenario of photographing and judging questions or automatically judging questions, each answer in the test paper corresponds to one question stem. So in the matching process, the center point of the stem frame matched with each answer frame can be predicted. In near map, as shown in fig. 5, a corresponding stem frame may be matched for each pixel point in the answer frame. The matched question stem box label is shown as a rectangular box in fig. 5, and a black dot in the rectangular box is the central point of the question stem box. The black arrows in fig. 5 indicate the correspondence between the pixels in the answer frame and the stem frame.

In one embodiment, the method further comprises employingThe convolutional layer that generates near map is trained in the way: calculating a loss value of a predicted position of a stem frame matched with a pixel point in an answer frame by using a smooth L1 loss functionL _near。

The L1 norm loss function, also known as the minimum absolute deviation, minimizes the sum of the absolute differences of the target and estimated values. The smoothed L1 loss function is the L1 norm loss function after smoothing. The L1 norm loss function has the disadvantage of being breakpoint, not smooth and unstable. While the image at the far coordinate origin of the smoothed L1 loss function is very close to the L1 norm loss function, the image transitions near the coordinate origin are very smooth. Thus smoothing the L1 loss function overcomes the drawbacks of the L1 norm loss function.

In summary, for the network structure shown in fig. 2, the following equation 6 can be used to calculate the total loss value of the network structure.

Equation 6:

wherein, loss represents the total loss value of the network structure, and a, b and c are preset coefficients. In one example, a may be set to 0.3, b to 20, and c to 10.

Fig. 6 is a flowchart of an image detection method according to another embodiment of the present application. As shown in fig. 6, in an embodiment, in step S130 in fig. 1, obtaining a matching relationship between the stem frame and the answer frame of the image to be detected according to the probability that the pixel belongs to the text prediction frame, the prediction position of the text prediction frame to which the pixel belongs, and the prediction position of the stem frame matched with the pixel in the answer frame, specifically, the method may include:

step S210, obtaining a stem frame set and an answer frame set of the image to be detected according to the probability that the pixel point belongs to the text prediction frame and the prediction position of the text prediction frame to which the pixel point belongs;

and step S220, obtaining the matching relation between the question stem frame and the answer frame of the image to be detected according to the question stem frame set, the answer frame set and the predicted position of the question stem frame matched with the pixel points in the answer frame.

Fig. 7 is a flowchart of an image detection method according to another embodiment of the present application. As shown in fig. 7, in an embodiment, in step S210 in fig. 6, obtaining a stem frame set and an answer frame set of the image to be detected according to the probability that the pixel belongs to the text prediction frame and the prediction position of the text prediction frame to which the pixel belongs, which may specifically include:

step S310, determining pixel points belonging to a text prediction box in the image to be detected according to the probability that the pixel points belong to the text prediction box;

step S320, aiming at the pixel points belonging to the text prediction box, obtaining the prediction positions of the text prediction box to which the pixel points belong;

and step S330, obtaining a stem frame set and an answer frame set of the image to be detected by using a non-maximum suppression algorithm according to the prediction position of the text prediction frame to which the obtained pixel point belongs.

Referring to fig. 2 and fig. 3 again, the probability that each pixel point in the image to be detected belongs to the text prediction box can be predicted in the cls map. In step S310, the pixel points belonging to the text prediction box in the image to be detected are determined according to the probability that the pixel points belong to the text prediction box and a preset probability threshold. For example, the background probability value of the predicted pixel point is 0.05, the stem probability value is 0.80, and the answer probability is 0.15. The preset probability threshold is 0.70. And determining that the pixel point belongs to the question stem frame because the question stem probability value is greater than the preset probability threshold.

Referring to fig. 2 and 4 again, the predicted position of the text prediction box to which each pixel point in the image to be detected belongs can be predicted in the rbox map, including the distance from the pixel point to each edge of the text prediction box to which the pixel point belongs and the predicted angle of the text prediction box relative to the image to be detected. In step S320, for the pixel points belonging to the stem frame and the answer frame determined in step S310, the prediction position of the text prediction frame to which the pixel points belong is obtained. In step S330, a candidate frame of the stem frame and a candidate frame of the answer frame in the image to be detected are obtained according to the prediction position of the text prediction frame to which the pixel point belongs, which is obtained in step S320. And then eliminating redundant candidate frames by using a Non-maximum suppression algorithm (NMS) to find the optimal detection position, and obtaining a stem frame set and an answer frame set of the image to be detected. The local maximum value can be searched by using a non-maximum value suppression algorithm, and non-maximum value elements are suppressed.

Fig. 8 is a flowchart of an image detection method according to another embodiment of the present application. As shown in fig. 8, in an embodiment, in step S220 in fig. 6, obtaining a matching relationship between the stem frame and the answer frame of the image to be detected according to the stem frame set, the answer frame set, and the predicted position of the stem frame matched with the pixel point in the answer frame includes:

step S410, obtaining the predicted position of the question stem frame matched with the answer frame by utilizing a clustering algorithm according to the predicted position of the question stem frame matched with the pixel point in the answer frame;

step S420, calculating Euclidean distance between the predicted position of the stem frame matched with the answer frame and the central point of each stem frame in the stem frame set;

and step S430, taking the question stem frame corresponding to the minimum Euclidean distance as the question stem frame matched with the answer frame.

Referring to fig. 2 and 5 again, the matching stem frame corresponding to each pixel point of the image to be detected can be predicted in the near map, and the predicted position of the stem frame matching with the pixel point in the answer frame can be obtained from the matching stem frame. For each answer box in the answer box set, in step S410, a predicted position of the stem box matching the answer box is obtained by using a clustering algorithm. For example, 100 pixels are shared in a certain answer frame, wherein the matching result of 70 pixels is the question stem frame with the question stem content of "1 +1 =", the matching result of 20 pixels is the question stem frame with the question stem content of "1 +2 =", and the matching result of 10 pixels is the question stem frame with the question stem content of "1 +3 =". Then the clustering algorithm is used to obtain the question stem box matched with the answer box as the question stem box with the question stem content of "1 +1 =".

In step S420, an euclidean distance between the predicted position of the stem frame matching the answer frame acquired in step S410 and the center point of each stem frame in the stem frame set is calculated. In step S430, the stem frame corresponding to the minimum euclidean distance calculated in step S420 is set as the stem frame matching the answer frame.

Fig. 9 is a flowchart of a post-processing procedure of an image detection method according to another embodiment of the present application. The post-processing process includes obtaining a cls map, a rbox map and a near map in the network structure shown in fig. 2, and then further processing the enhanced feature images to finally obtain the matching relationship between the stem frame and the answer frame of the image to be detected. As shown in fig. 9, the steps of an exemplary post-processing procedure are as follows:

step 9.1: and performing softmax (normalization) processing on the cls map to obtain a classification result of each pixel point.

Step 9.2: and obtaining the cls map classification result as pixel points belonging to the question stem frame and the answer frame, wherein the position of the middle point in the characteristic image can be represented by a pixel coordinate system. And extracting corresponding pixel point predicted values on the rbox map aiming at the acquired pixel points belonging to the question stem frame and the answer frame to obtain candidate question stem frames and candidate answer frames.

Step 9.3: and processing the candidate question stem frames and the candidate answer frames by the NMS, and eliminating redundant candidate frames to obtain a question stem frame set S1 and an answer frame set S2. The number of the question stem frames is N, and the number of the answer frames is M.

Step 9.4: the variable i is assigned a value of 0. The question stem box and answer box matching set is set S3.

Step 9.5: and obtaining the coordinates of the central point of the matched stem frame predicted by the pixel point in each answer frame in the set S2 from the near map, and obtaining the stem frame matched with the answer frame according to the stem frame matched with each pixel point in the answer frame. For example, a clustering algorithm may be used to obtain a stem box that matches the answer box.

Step 9.6: and judging whether i is smaller than M, if so, executing the step 9.7 and the step 9.8, and returning to execute the step 9.6. That is, the traversal operation is performed on the answer box set, and the operation of step 9.7 is performed on each element in the answer box set. If i is not smaller than M, and the traversal process is finished, step 9.9 is executed to output the set S3.

Step 9.7: and calculating Euclidean distances between the center point of the matched stem frame predicted by the ith answer frame and the center point of each stem frame in the stem frame set, and acquiring the stem frame corresponding to the minimum Euclidean distance as the stem frame matched with the ith answer frame. The ith answer box and its matching stem box are incorporated into the set S3.

Step 9.8: the variable i is incremented by 1.

Fig. 10 is a schematic structural diagram of an image detection apparatus according to an embodiment of the present application. As shown in fig. 10, the apparatus may include:

the extraction unit 100 is configured to perform feature extraction on an image to be detected to obtain a feature image to be detected;

the processing unit 200 is configured to process the feature image to be detected by using the convolution layer to obtain a probability that a pixel point in the image to be detected belongs to the text prediction box, a prediction position of the text prediction box to which the pixel point belongs, and a prediction position of the question stem box matched with the pixel point in the answer box; the text prediction box comprises a question stem box and an answer box;

the detection unit 300 is configured to obtain a matching relationship between the stem frame and the answer frame of the image to be detected according to the probability that the pixel point belongs to the text prediction frame, the prediction position of the text prediction frame to which the pixel point belongs, and the prediction position of the stem frame matched with the pixel point in the answer frame.

In one embodiment, the extraction unit 100 is configured to:

Fig. 11 is a schematic structural diagram of a detection unit of an image detection apparatus according to an embodiment of the present application. As shown in fig. 11, in one embodiment, the detection unit 300 includes:

the first detection subunit 310 is configured to obtain a stem frame set and an answer frame set of the image to be detected according to the probability that the pixel belongs to the text prediction frame and the prediction position of the text prediction frame to which the pixel belongs;

the second detecting subunit 320 is configured to obtain a matching relationship between the stem frame and the answer frame of the image to be detected according to the stem frame set, the answer frame set, and the predicted position of the stem frame matched with the pixel point in the answer frame.

In one embodiment, the first detection subunit 310 is configured to:

In one embodiment, the second detection subunit 320 is configured to:

Fig. 12 is a schematic structural diagram of an image detection apparatus according to another embodiment of the present application. As shown in fig. 13, in one embodiment, the apparatus further comprises a training unit 400 for training the convolutional layer in at least one of the following ways:

The functions of each module in each apparatus in the embodiments of the present invention may refer to the corresponding description in the above method, and are not described herein again.

FIG. 13 is a block diagram of an electronic device used to implement embodiments of the present application. As shown in fig. 13, the electronic apparatus includes: a memory 910 and a processor 920, the memory 910 having stored therein computer programs operable on the processor 920. The processor 920 implements the image detection method in the above-described embodiment when executing the computer program. The number of the memory 910 and the processor 920 may be one or more.

The electronic device further includes:

and a communication interface 930 for communicating with an external device to perform data interactive transmission.

If the memory 910, the processor 920 and the communication interface 930 are implemented independently, the memory 910, the processor 920 and the communication interface 930 may be connected to each other through a bus and perform communication with each other. The bus may be an Industry Standard Architecture (ISA) bus, a Peripheral Component Interconnect (PCI) bus, an Extended ISA (EISA) bus, or the like. The bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one thick line is shown in FIG. 13, but this is not intended to represent only one bus or type of bus.

Optionally, in an implementation, if the memory 910, the processor 920 and the communication interface 930 are integrated on a chip, the memory 910, the processor 920 and the communication interface 930 may complete communication with each other through an internal interface.

Embodiments of the present invention provide a computer-readable storage medium, which stores a computer program, and when the program is executed by a processor, the computer program implements the method provided in the embodiments of the present application.

The embodiment of the present application further provides a chip, where the chip includes a processor, and is configured to call and execute the instruction stored in the memory from the memory, so that the communication device in which the chip is installed executes the method provided in the embodiment of the present application.

An embodiment of the present application further provides a chip, including: the system comprises an input interface, an output interface, a processor and a memory, wherein the input interface, the output interface, the processor and the memory are connected through an internal connection path, the processor is used for executing codes in the memory, and when the codes are executed, the processor is used for executing the method provided by the embodiment of the application.

It should be understood that the processor may be a Central Processing Unit (CPU), other general purpose processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other programmable logic device, discrete gate or transistor logic, discrete hardware components, etc. A general purpose processor may be a microprocessor or any conventional processor or the like. It is noted that the processor may be an advanced reduced instruction set machine (ARM) architecture supported processor.

Further, optionally, the memory may include a read-only memory and a random access memory, and may further include a nonvolatile random access memory. The memory may be either volatile memory or nonvolatile memory, or may include both volatile and nonvolatile memory. The non-volatile memory may include a read-only memory (ROM), a Programmable ROM (PROM), an Erasable PROM (EPROM), an electrically Erasable EPROM (EEPROM), or a flash memory. Volatile memory can include Random Access Memory (RAM), which acts as external cache memory. By way of example, and not limitation, many forms of RAM are available. For example, Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), Synchronous Dynamic Random Access Memory (SDRAM), double data rate synchronous SDRAM (DDR SDRAM), Enhanced SDRAM (ESDRAM), synchlink DRAM (SLDRAM), and direct memory bus RAM (DR RAM).

In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. The procedures or functions according to the present application are generated in whole or in part when the computer program instructions are loaded and executed on a computer. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored in a computer-readable storage medium or transmitted from one computer-readable storage medium to another computer-readable storage medium.

In the description herein, references to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the application. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, various embodiments or examples and features of different embodiments or examples described in this specification can be combined and combined by one skilled in the art without contradiction.

Furthermore, the terms "first", "second" and "first" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one such feature. In the description of the present application, "a plurality" means two or more unless specifically limited otherwise.

Any process or method descriptions in flow charts or otherwise described herein may be understood as representing modules, segments, or portions of code which include one or more executable instructions for implementing specific logical functions or steps of the process. And the scope of the preferred embodiments of the present application includes other implementations in which functions may be performed out of the order shown or discussed, including substantially concurrently or in reverse order, depending on the functionality involved.

The logic and/or steps represented in the flowcharts or otherwise described herein, e.g., an ordered listing of executable instructions that can be considered to implement logical functions, can be embodied in any computer-readable medium for use by or in connection with an instruction execution system, apparatus, or device, such as a computer-based system, processor-containing system, or other system that can fetch the instructions from the instruction execution system, apparatus, or device and execute the instructions.

It should be understood that portions of the present application may be implemented in hardware, software, firmware, or a combination thereof. In the above embodiments, the various steps or methods may be implemented in software or firmware stored in memory and executed by a suitable instruction execution system. All or part of the steps of the method of the above embodiments may be implemented by hardware that is configured to be instructed to perform the relevant steps by a program, which may be stored in a computer-readable storage medium, and which, when executed, includes one or a combination of the steps of the method embodiments.

In addition, functional units in the embodiments of the present application may be integrated into one processing module, or each unit may exist alone physically, or two or more units are integrated into one module. The integrated module can be realized in a hardware mode, and can also be realized in a software functional module mode. The integrated module may also be stored in a computer-readable storage medium if it is implemented in the form of a software functional module and sold or used as a separate product. The storage medium may be a read-only memory, a magnetic or optical disk, or the like.

The above description is only for the specific embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily conceive various changes or substitutions within the technical scope of the present application, and these should be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. An image detection method, comprising:

processing the characteristic image to be detected by using a convolution layer to obtain the probability that pixel points in the image to be detected belong to a text prediction frame, the prediction position of the text prediction frame to which the pixel points belong and the prediction position of a question stem frame matched with the pixel points in an answer frame; wherein the text prediction box comprises the stem box and the answer box;

obtaining the matching relation between the question stem frame and the answer frame of the image to be detected according to the probability that the pixel point belongs to the text prediction frame, the prediction position of the text prediction frame to which the pixel point belongs and the prediction position of the question stem frame matched with the pixel point in the answer frame;

the method for obtaining the matching relation between the question stem frame of the image to be detected and the answer frame comprises the following steps of obtaining the matching relation between the question stem frame of the image to be detected and the answer frame according to the probability that the pixel point belongs to the text prediction frame, the prediction position of the text prediction frame to which the pixel point belongs and the prediction position of the question stem frame matched with the pixel point in the answer frame, wherein the matching relation comprises the following steps: obtaining a stem frame set and an answer frame set of the image to be detected according to the probability that the pixel point belongs to the text prediction frame and the prediction position of the text prediction frame to which the pixel point belongs; and obtaining the matching relation between the question stem frame and the answer frame of the image to be detected according to the question stem frame set, the answer frame set and the predicted position of the question stem frame matched with the pixel points in the answer frame.

2. The method according to claim 1, wherein the step of extracting the features of the image to be detected to obtain the feature image to be detected comprises the following steps:

inputting the image to be detected into a residual error neural network model to obtain a multilayer characteristic image;

3. The method of claim 1, wherein the predicting the position of the text prediction box to which the pixel point belongs comprises: the distance from the pixel point to each side of the text prediction box and the prediction angle of the text prediction box relative to the image to be detected.

4. The method of claim 1, wherein the predicted location of the stem frame matching the pixel points in the answer frame comprises: and the coordinate value of the central point of the question stem frame matched with the pixel point in the answer frame.

5. The method of claim 1, wherein obtaining a stem frame set and an answer frame set of the image to be detected according to the probability that the pixel point belongs to the text prediction frame and the prediction position of the text prediction frame to which the pixel point belongs comprises:

determining the pixel points belonging to the text prediction box in the image to be detected according to the probability that the pixel points belong to the text prediction box;

6. The method of claim 1, wherein obtaining the matching relationship between the stem frame and the answer frame of the image to be detected according to the stem frame set, the answer frame set, and the predicted position of the stem frame matched with the pixel point in the answer frame comprises:

calculating Euclidean distance between the predicted position of the question stem frame matched with the answer frame and the central point of each question stem frame in the question stem frame set;

7. The method of any one of claims 1 to 4, further comprising training the convolutional layer in at least one of:

calculating the loss value of the probability of the pixel point belonging to the text prediction box by using a dess coefficient difference function;

calculating a loss value of the prediction position of the text prediction box to which the pixel point belongs by using an intersection ratio loss function;

and calculating the loss value of the predicted position of the question stem frame matched with the pixel points in the answer frame by using a smooth L1 loss function.

8. An image detection apparatus, characterized by comprising:

the processing unit is used for processing the characteristic image to be detected by utilizing the convolution layer to obtain the probability that pixel points in the image to be detected belong to the text prediction frame, the prediction position of the text prediction frame to which the pixel points belong and the prediction position of the question stem frame matched with the pixel points in the answer frame; wherein the text prediction box comprises the stem box and the answer box;

the detection unit is used for obtaining the matching relation between the question stem frame and the answer frame of the image to be detected according to the probability that the pixel point belongs to the text prediction frame, the prediction position of the text prediction frame to which the pixel point belongs and the prediction position of the question stem frame matched with the pixel point in the answer frame;

wherein the detection unit includes: the first detection subunit is used for obtaining a stem frame set and an answer frame set of the image to be detected according to the probability that the pixel point belongs to the text prediction frame and the prediction position of the text prediction frame to which the pixel point belongs; and the second detection subunit is used for obtaining the matching relation between the question stem frame and the answer frame of the image to be detected according to the question stem frame set, the answer frame set and the predicted position of the question stem frame matched with the pixel points in the answer frame.

9. The apparatus of claim 8, wherein the extraction unit is configured to:

10. The apparatus of claim 8, wherein the predicted position of the text prediction box to which the pixel point belongs comprises: the distance from the pixel point to each side of the text prediction box and the prediction angle of the text prediction box relative to the image to be detected.

11. The apparatus of claim 8, wherein the predicted position of the stem frame matching the pixel point in the answer frame comprises: and the coordinate value of the central point of the question stem frame matched with the pixel point in the answer frame.

12. The apparatus of claim 8, wherein the first detection subunit is configured to:

13. The apparatus of claim 8, wherein the second detection subunit is configured to:

14. The apparatus according to any one of claims 8 to 11, further comprising a training unit for training the convolutional layer in at least one of:

15. An electronic device comprising a processor and a memory, the memory having stored therein instructions that are loaded and executed by the processor to implement the method of any of claims 1 to 7.

16. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the method according to any one of claims 1 to 7.