CN116824608A

CN116824608A - Answer sheet layout analysis method based on target detection technology

Info

Publication number: CN116824608A
Application number: CN202310667530.0A
Authority: CN
Inventors: 付鹏斌; 张旭; 杨惠荣
Original assignee: Beijing University of Technology
Current assignee: Beijing University of Technology
Priority date: 2023-06-07
Filing date: 2023-06-07
Publication date: 2023-09-29

Abstract

The invention discloses an answer sheet layout analysis method based on target detection, which is used for carrying out layout analysis on an answer sheet image. Firstly, preprocessing an original image to obtain a question answering card image which is free of inclination, proper in brightness and free of frame interference; then detecting foreground contents such as handwriting contents, graphs, tables and the like by using a layout target detection model YOLOv5s-DC trained by a decoupling head and an improved loss function, and performing structural analysis on a layout by using a projection-based background segmentation method; after dividing a foreground object into various areas by utilizing the structural information obtained by analysis, supplementing and edge-segmenting foreground contents by using an MSER algorithm and a SeamCarving algorithm, detecting row and column lines of table contents in a layout, and analyzing a table structure; and finally, identifying the text in the layout by using the CRNN network.

Description

Answer sheet layout analysis method based on target detection technology

Technical Field

The invention relates to the fields of image segmentation, text recognition and machine learning, in particular to a answer sheet layout analysis method based on a target detection technology.

Background

Layout analysis is a key technology for extracting various text, formulas, graphics, tables and other elements from document images for document identification and document understanding. Layout analysis is the first step of automatic scoring and other OCR tasks, and the effect of the layout analysis algorithm is critical to subsequent text recognition and natural language processing algorithms. In recent years, although many layout analysis methods suitable for different scenes are proposed in the field of document identification, the algorithms still have some limitations when facing the answer sheet. The answer sheet is different from other printed document layouts such as papers and newspapers and from handwritten chapter and paragraph texts, and a large amount of printed texts and unrestricted handwritten texts are simultaneously contained in the answer sheet, so that the answer sheet is difficult to process by the conventional method.

The handwritten text content is the key content in the answer sheet image, contains the answer information of the subjective questions of the students, is important for the automatic paper reading technology, and the printed text in the background also has important information such as layout structure information, scores and the like.

The traditional layout analysis technology often uses projection segmentation, edge detection, connected domain and other methods to segment or combine the layout from top to bottom or from bottom to top. The method has the advantages that the detection omission and false detection exist on the answer sheet, the detection result is only a plain text, the structural information in the layout is lost, the layout analysis algorithm based on the machine learning algorithm is mostly only suitable for the printed layout or the handwritten chapter level text at present, and no special method and related data for the layout analysis of the answer sheet exist.

Disclosure of Invention

Aiming at the problems, the invention extracts the layout structure of the answer sheet by combining a target detection algorithm and a traditional layout analysis method, trains an improved YOLOv5s-DC target detection network by marking an answer sheet target detection data set, and realizes layout analysis of the answer sheet layout by using algorithm flows such as recursion segmentation, text line clustering, CRNN text recognition, form recognition and the like.

The main steps for realizing the invention are as follows:

firstly, preprocessing an original image, and adjusting an image gamma value to ensure that a background pixel obtained after binarization is clear; then, carrying out inclination correction based on Hough straight line detection on the binarized image, and rotating the angle of the longest straight line in the image to a horizontal angle to ensure that the background content of the answer sheet is not inclined; then, detecting and extracting non-background elements in the layout by using a target detection algorithm, and setting the detected content area as blank so as to reduce the interference of elements such as handwritten text and the like on structural analysis; and then, using a recursion projection segmentation method, carrying out segmentation on the rest content by using a layout-column-row-text block. And extracting text regions from each text quick region obtained by segmentation through an MSER algorithm, and obtaining text lines which can be used for recognition by using a KNN-based text line clustering algorithm. And distributing the result obtained by target detection to the text region cut out according to the position, and then combining the original image of the region to complement the result of target detection and cut the edge. In addition, for the detected table contents, first, the area to which each boundary position belongs is judged, and the false detection area generated during the target detection is cut out, so that the table analysis is performed.

The answer sheet layout analysis method based on the target detection technology comprises the following steps:

step one, performing inclination correction and gamma correction on an input answer sheet image, and cutting edge blank or black edge areas through a flooding filling algorithm and a rectangular detection algorithm.

Training by using the YOLOv5s-DC network after training the constructed answer sheet layout analysis target detection data set, improving a loss function by using a decoupled detection head and EIOU and QFL, supervising classification and regression tasks of the loss function, and carrying out target detection on the answer sheet obtained in the previous step by using the trained network. The detected target area is emptied to reduce interference with subsequent steps.

And thirdly, recursively segmenting the rest layout, and firstly, reversely coloring pixels during segmentation so as to process pixel distribution. When the line is cut, firstly, the transverse corrosion operation and the longitudinal expansion operation are carried out on the layout so as to highlight the image cutting point characteristics. And then carrying out vertical projection on the image, taking one half of the maximum value of the projection result as a segmentation threshold value, extracting all maximum value areas in the projection result curve, removing points smaller than the threshold value from the maximum value points, and retaining the points larger than or equal to the threshold value. And combining and grouping adjacent potential division points by utilizing a sliding window, and taking the average value of coordinates in the group as the division position.

And step four, extracting texts in each segmented region, detecting each region by using a maximum stable region algorithm MSER, and calculating the center position and the respective distance of each connected region after obtaining the text region. And combining strokes and characters of the text region by using a KNN algorithm with the K value of 3 to obtain a text line content region which can be used for recognition.

And fifthly, comparing the target detection result with the original image of the segmented region, extracting the missing text by using the MSER, merging the missing text into the target detection result, and segmenting the overlapped part of the detection frame by using the SeamCarving algorithm so as to ensure the extraction accuracy.

And step six, judging the detected table content by using Hough straight line detection, cutting the edge of the table detection frame according to the position of the dividing region, detecting the circumscribed rectangle of the table by using a rectangle detection algorithm, and correcting the inclination of the table. The frame structure in the table is detected by using the expansion operations in the horizontal and vertical directions, and the overlapped points in the results of the two are used as the table structure points to be selected. For each candidate structure point p _i,j Judging another structural point p to the right and the lower side _i+1,j And p _i,j+1 If there is a complete line of communication, then it is proved that there is a text cell _i,j Otherwise, it is removed. And judging all the positions to be selected to obtain a final table structure, and extracting an external rectangular frame at the central position in each table cell to extract the text.

And seventhly, recognizing text contents by using a convolutional cyclic neural network, wherein for special texts such as filling and the like of the selection questions, after a character template is added into a UTF-8 word stock, a Text Recognition Generator tool can be used for generating character image data for training, and the network can be trained.

Drawings

Fig. 1 is a schematic view of the tilt correction effect. a) Before tilt correction; b) After inclination correction

FIG. 2 is a schematic diagram of dataset construction.

FIG. 3 is a schematic diagram of a YOLOv5s-DC network.

Fig. 4 is a diagram of a decoupled detector head.

Fig. 5 is a target detection result diagram.

FIG. 6 is a schematic view of a split-column method. a) An original background image; b) A background image after lateral expansion; c) An original background projection result; d) Post-expansion background projection results.

Fig. 7 is a layout division result diagram.

Fig. 8 is a text line extraction result. a) A head-up portion; b) And filling the question part.

Fig. 9 is a schematic diagram of a completion algorithm. a) A target detection result; b) The result is completed.

Fig. 10 is a combined schematic.

FIG. 11 is a comparison of the effects of the method. a) A target detection result; b) MSER results; c) Mixing the results.

Fig. 12 is an edge cutting effect diagram. a) Original image and segmentation result; b) Gradient image and segmentation result.

FIG. 13 is a diagram showing the choice question special characters.

Fig. 14 is a text recognition dataset.

Fig. 15 is a diagram showing the result of identifying the choice questions.

Fig. 16 is a diagram of other text recognition results. a) Selecting a question image; b) And (5) identifying a result.

Fig. 17 is a table analysis result diagram. a) A head-up region; b) And filling the question area.

Fig. 18 is a diagram showing the result of the overall layout analysis.

Detailed Description

The invention is further described below with reference to the drawings and detailed description.

The flow of the method related by the invention comprises the following steps:

(1) And (5) preprocessing an image. And detecting straight lines in the image by using a Hough straight line detection algorithm, taking the longest straight line as a reference, calculating a rotation matrix H required by correction according to the angle of the longest straight line relative to the horizontal plane, and correcting the image by using the rotation matrix H. Fig. 1 shows image effect contrast before and after tilt correction.

(2) Target detection data are constructed and the YOLOv5s model is improved and trained. And labeling the answer sheet image by using a labelme image labeling tool. And positioning the upper left corner and the lower right corner of the target by using a rectangular labeling mode, and labeling according to the category. FIG. 2 is a labeling software interface. Fig. 3 is a basic structure of YOLOv5s model, fig. 4 is a specific structure of head part in fig. 3, the invention uses Pytorch deep learning frame to construct network model, training is carried out for 200 rounds, initial learning rate is 0.001, input size is 1280x1280, batch size is 12, SGD optimizer is used, fig. 5 is a test result obtained by inputting an answer sheet image for the object test method. The loss function used in training is as follows, and mainly consists of the following three parts: regression loss L _{_reg} Confidence loss L _{_conf} And classification loss L _{_cls} 。

L＝L _reg +L _conf +L _cls

L _cls ＝-|y-σ| ^β (1-y)log(1-σ)+ylog(σ))

(3) And (5) segmentation of the layout background. The common answer sheet layouts can be divided into two types according to the column division condition, namely single-column answer sheets and multi-column answer sheets, as shown in the figure. Single-column answer sheets are generally A4-size and are commonly used for English, physical, chemical, biological, geographical, political and historical subjects, and multi-column answer sheets are generally A3-size and are commonly divided into 3 columns and are commonly used for Chinese and mathematical answers. In order to analyze its layout structure, it should be subjected to column and row division processing. From the whole structure, the area of each column in the multi-column answer sheet is the same, and the shape is regular. Therefore, the conventional gray level histogram is used for counting the gray level distribution condition, and is subjected to column division processing. The specific algorithm flow is as follows:

step1: in order to highlight the characteristic of the column positions, interference of projection of non-frame lines is reduced, and expansion processing is carried out on black parts in the image along the horizontal direction. Fig. 6 shows the expansion before and after and the projection results, and it is obvious that the lines in the vertical direction are thickened, while the lines in the horizontal direction are unchanged.

Step2: and calculating the gray sum of each column of the image along the vertical direction to obtain a gray histogram. Fig. 6 (c) and 6 (d) are gray scale projection comparisons before and after expansion. It can be found that the segmentation points in the projection result are more obvious after morphological processing, and the segmentation points are larger in difference and proportion from other points in terms of numerical value.

Step3: and taking half of the maximum value in the projection result as a segmentation threshold t, and taking the point larger than t as a to-be-segmented point.

Step4: for each pending partition point, the partition points with horizontal distances less than 100 pixels are merged into a group.

Step5: taking the center of the undetermined dividing point in each group as a final dividing point, and dividing the image along the vertical direction.

END

And recursively using the segmentation algorithm to sequentially perform column segmentation, row segmentation and block segmentation on the layout. Fig. 7 is a layout division result.

After the division of each region is completed, we calculate the distribution dispersion degree of the elements in each region by using the following region discrimination algorithm to discriminate whether the region is a choice question filling region.

Compared with other areas, the selected question answering area has obvious characteristics, and after the interference of noise points is removed, the internal texts are uniformly distributed according to a regular chessboard shape.

Defining the distance from each connected domain to the nearest neighbor as D, and selecting two groups of distances D in the question area _i 、D _j Size and non-selection questions (e.gGap filling questions) D _m 、D _n And are more closely related. Therefore, the object of distinguishing the choice question area can be achieved by only judging the position distribution state and the number of the connected areas in each area.

The specific algorithm comprises the following steps:

BEGIN

step1: in order to integrate the numbers in each question number into one connected domain, the distance between adjacent connected domains is more uniform, and the input image is subjected to expansion processing.

Step2: extracting all connected domains C in the region.

Step3: calculating each connected domain c _i Center point coordinate p of (2) _i 。

Step4: calculating each connected domain c _i To other c _j Finding the nearest neighbor of each connected domain and putting its distance into the list L.

Step5: the variance V of L is calculated. If V is larger than the threshold t and the number of connected domains is larger than n, the current area is judged to be the choice question area.

END

(4) Background text extraction

After the background image is segmented, in order to obtain the text content of each part, the text of each part needs to be further extracted and identified. For this purpose, the MSER algorithm is used herein to extract the printed text in each region. Since a single maximum stable region may not be a character, or may be only a part of a character, in order to ensure that characters of different sizes can be accurately recognized, it is necessary to combine the extracted characters. If the multi-connected domain texts such as Chinese 'two' and 'three' in the figure are not combined, the recognition error can be caused, and the judgment of the question number is affected.

The KNN algorithm is selected for merging each MSER region herein. The value of K can influence the text line merging effect to a great extent, too small K can cause undermerging, stroke omission or text line interruption are caused, and too large K can increase the calculated amount and reduce the algorithm efficiency. In the environment of Chinese answer sheet, experiments prove that the combination effect is best when the K value is taken to be 4. The combined characters are converted into text lines from a single connected domain, so that the recognition rate can be effectively improved, a complete sentence structure is formed, and the subsequent semantic understanding can be conveniently carried out. Fig. 8 shows the result of merging text connected domains, wherein green rectangles are circumscribed rectangles of characters, and red lines are neighbor connected results.

(5) Missing complement, edge cut

Although YOLOv5s-DC has high accuracy in the task of detecting the target of the layout of the answer sheet, some strokes or text may still be omitted. In part, because rectangular boxes have limited representation capability, cannot cover some lengthy strokes, or irregular text placement. To this end, an algorithm is proposed herein that complements the detection result with a maximally stable extremal region (Maximally Stable Extremal Regions). By erasing the target region detected by the YOLOv5s-DC, the amount of calculation required by the MSER algorithm can be effectively reduced, and then the completed result and the detection result are combined, so that a complete extraction result can be obtained, and the completed effect is shown in fig. 9.

The MSER is an algorithm for text detection in a natural scene, and the algorithm utilizes a dynamic threshold value to carry out multiple binarization processing on an image, and a connected domain with small area change rate is used as a maximum stable extremum region in the process. Area change rate v of connected domain _i Can be calculated by the following formula, wherein Q _i For the area of the connected domain i, Q _i-Δ The area of this connected domain after a change of the threshold value delta.

After the MSER is used for extraction, the results obtained by the two algorithms are combined. When in merging, the invention takes the text content rectangular box obtained by target detection as the center, and merges small pieces of content in a certain range on two sides of the text content rectangular box. In order not to cause additional overlapping when merging, the present invention only connects blank areas with overlapping portions of coordinates in two rectangular frames, forming circumscribed polygons of text lines, as shown in fig. 10. To verify the effectiveness of this mixing method, we performed experimental comparisons and the effects of the three methods are shown in FIG. 11. Table 1 is a detailed index of text detection results of three methods on the same group of data, and the hybrid method provided by the invention can fully utilize the high-efficiency performance and classification capability of the target detection network on the basis of not obviously increasing the detection time, and simultaneously supplements missing contents of target detection by utilizing the high recall rate of the traditional method.

TABLE 1

For the problem of overlapping of detection frames, the method uses an edge segmentation algorithm to reposition the upper edge and the lower edge of the text in each detection frame, and the specific algorithm is as follows.

BEGIN

Step1: and (3) calculating gradient grad at each point in the I by using a Sobel operator, and directly using the sum of absolute values of gradient values in two directions as a gradient result in order to reduce the calculated amount of square and square again, wherein the absolute value sum is shown as a formula.

Step2: let M be the minimum path consumption matrix, B be the traceback path storage matrix, initialize M and B to be all 0 matrices of the same size as img. The initial iteration coordinates are (i, j), i is the row coordinates of j columns of minimum grads, and j is 0 as the initial value, which means calculation from left to right.

Step3: for each i, j iterates from 0 to w-1, w is the width of img, calculates the minimum energy sum according to the state transition of dynamic programming, and stores the current minimum energy sum corresponding to the last step coordinate into B, so as to represent the shortest path from (i, 0) to the current coordinate.

Step4: iterating each i repeatedly, and storing the consumption and coordinates of each path into M and B.

Step5: and taking the smallest point in the path end point on the right side of the M matrix as the path end point, backtracking according to the coordinates in the B, and adding the coordinates (i, j) obtained each time into the path.

END

Fig. 12 shows the result of text line edge positioning.

(6) Table analysis

First, a form area detected by target detection is cut to obtain a form image I. And then carrying out binarization processing on the I to obtain a binary image B, and carrying out inverse color processing. And detecting a connected domain list C existing in the image by using a corner detection function. Traversing the list, calculating the minimum circumscribed rectangle of the list, and putting the minimum circumscribed rectangle into the list records. After traversing, taking the rectangle with the largest area in the records as a table positioning result, and cutting the input image along x, y, w and h of the rectangle respectively representing the abscissa, the ordinate, the width and the height. And (3) performing multiple opening operations on the cut binary image along the horizontal direction and the vertical direction by using a one-dimensional expansion operator so as to extend lines existing in the binary image to obtain two characteristic diagrams of DC and DR, wherein the characteristic diagrams respectively represent transverse lines and vertical lines in the table. AND performing bitwise AND operation on the two feature images to obtain feature image AND, wherein the position with 1 in the image is the intersection point of line segments in the table. In order to reduce interference caused by repeated points, the invention uses a sliding window method to perform de-duplication processing on the positions of all intersection points along the horizontal direction and the vertical direction respectively to obtain a coordinate list X and a coordinate list Y. Thereafter, for any structural point p _i,j Judging another structural point p to the right and the lower side _i+1 J and p _i,j+1 If there is a complete line of communication, then it is proved that there is a text cell _i,j Otherwise, it is removed. Finally, for each cell _i,j Taking the minimum circumscribed rectangle of the text as a cutting edge, extracting the text for subsequent recognition, and obtaining a table analysis result as shown in fig. 13.

(7) Text content recognition. In order to identify text content in a layout and facilitate subsequent natural language processing work, the invention uses C after expanding a character libraryThe RNN network identifies the text content. Fig. 17 is an expanded special character, which is added to the UTF-8 word stock, and then the character recognition training data as in fig. 15 is outputted using the print character generating tool. And processing the output result of the network by using a CTC algorithm so as to obtain a final identification result. For the text feature sequence x= { X of the input _{_t} I t=1, 2,.. _{_u} U=1, 2, U, CTC incorporates a blank as an output result when no content is output, the continuous identical content output in the network is separated by a space. The definition of process B from the output of the network to the final result is therefore: 1. consecutive identical symbols are combined. 2. And removing the blank placeholder. For example: b (#z#oo#o) =b #) z# # o#) =zoo.

For loss and confidence calculations, CTCs use a dynamic programming approach to solve for the conditional probability of the output. Let the output Y corresponding to the input X be "ZOO". For example, for time t1, the network may only output a blank and Z to potentially end up with ZOO, and likewise for time t2, the network may continue to hold a blank output or continue to output Z. For input X, the probability of output being Y is:

wherein a is _t Is to output corresponding result, p, on each time slice _t (a _t I X) is output a at time t _t Is a probability of (2). Assuming that the output of each time slice is independently distributed, the probability on a combination mode can be obtained by multiplying the output of each time slice, and the probability on each combination path can be added to obtain the probability of outputting Y.

In model training, only the negative log-likelihood function value needs to be minimized for input D.

In model prediction, a path combination with the maximum output probability is needed, namely solving:

fig. 16 and 17 show recognition results of special characters and other characters in the layout, respectively.

Claims

1. The answer sheet layout analysis method based on target detection is characterized by comprising the following steps:

step one, performing inclination correction and gamma correction on an input answer sheet image, and cutting a blank or black edge area of a cut edge through a flooding filling algorithm and a rectangular detection algorithm to obtain an answer sheet;

training by using a YOLOv5s-DC network after training of a constructed answer sheet layout analysis target detection data set, improving a loss function by using a decoupled detection head, an efficient IOU loss EIOU and a quality focus loss QFL, supervising classification and regression tasks of the loss function, and performing target detection on the answer sheet obtained in the previous step by using a trained network; the detected target area is emptied;

step three, recursively segmenting the rest layout, and reversely coloring pixels during segmentation; when the line is cut, firstly, performing transverse corrosion operation and longitudinal expansion operation on the layout so as to highlight the image cutting point characteristics; then, carrying out vertical projection on the image, taking one half of the maximum value of the projection result as a segmentation threshold value, extracting all maximum value areas in a projection result curve, removing points smaller than the threshold value from the maximum value points, and retaining the points larger than or equal to the threshold value; combining and grouping adjacent potential division points by utilizing a sliding window, and taking an average value of coordinates in the group as a division position;

extracting the text in each segmented region, detecting each region by using a maximum stable region algorithm MSER, and calculating the center position and the respective distance of each connected region after the text region is obtained; combining strokes and characters of the text region by using a KNN algorithm with a K value of 3 to obtain a text line content region for recognition;

fifthly, comparing the target detection result with the original image of the segmented region, extracting the missing text by using the MSER, merging the missing text into the target detection result, and segmenting the overlapped part of the detection frame by using the team clamping algorithm to ensure the extraction accuracy;

step six, judging the detected table content by using Hough straight line detection, cutting the edge of the table detection frame according to the position of the dividing region, detecting the circumscribed rectangle of the table by using a rectangle detection algorithm, and correcting the inclination of the table; detecting frame structures in a table by using expansion operations in the horizontal direction and the vertical direction respectively, and taking overlapping points in the results of the two as a table structure point set P to be selected; for each candidate structure point P in P _i,j I, j respectively represent the row and column numbers, and determine the other structural point p to the right and lower sides _i+1,j And p _i,j+1 If there is a complete line of communication, then it is proved that there is a text cell _i,j Otherwise, removing the waste water; judging all the positions to be selected to obtain a final table structure, and extracting an external rectangular frame at the central position in each table cell to extract the text;

and seventhly, recognizing text content by using a convolutional cyclic neural network CRNN, wherein for filling the selection questions, after adding a character template in a UTF-8 word stock, generating character image data for training by using a Text Recognition Generator tool, and training the network.

2. The answer sheet layout analysis method based on target detection according to claim 1, wherein the improved target detection method in the second step is as follows:

a. labeling the answer sheet image by using a labelme image labeling tool; positioning the upper left corner and the lower right corner of the target by using a rectangular marking mode, and marking according to categories;

b. using decoupled detection heads for original detection headsReplacement, comprising one-time dimension-reducing convolution conv ₁ Two efficient channel attention modules, the ESE module and four convolutions, with a convolution kernel of 1x1 size in the classification header and 3x3 size in the regression header; the EIOUloss and QFL loss are used for supervision in the training process respectively, and regression and classification tasks are used for supervision, wherein a specific calculation formula is as follows, C is the diagonal length of the minimum circumscribed rectangle of the detection frame and the real frame, and C is _{_w} And C _{_h} The width and the height of the sample are respectively, ρ represents the distance, β is the super parameter of the balance difficulty sample, and the value of β in the experiment is 1.5;

L _cls ＝-|y-σ| ^β ((1-y)log(1-σ)+ylog(σ))

c. for the detection result output by the network, setting the maximum value in the classification result as a classification confidence coefficient result cls of the current target point, setting the regression confidence coefficient as a target existence confidence coefficient reg, setting a confidence coefficient threshold value as 0.3, taking cls multiplied reg as a final target point confidence coefficient conf, and storing all points larger than the threshold value;

d. and c, using a non-maximum suppression algorithm nms, setting an nms threshold value to be 0.5, sequencing the results in the step c according to the confidence level from high to low, sequentially taking the result with the highest confidence level in the current list, comparing the result with the rest results in the list, comparing the result with the io, if the io is greater than the nms threshold value, removing the result, storing the result after traversing is finished, and setting the pixel of the corresponding area in the layout image to be 255.

3. The answer sheet layout analysis method based on target detection according to claim 1, wherein the layout background segmentation method in the third step is as follows:

a. highlighting the characteristic of the column position, and performing expansion processing on a black part in the image along the horizontal direction; binarizing the original image I, taking the inverse color to obtain a binarized image B, and using a one-dimensional structure operator op ₁ For imagesEtching along the horizontal direction to obtain an image E;

b. calculating the gray sum of each column of the image along the vertical direction to obtain a gray histogram P;

c. taking half of the maximum value in the projection result P as a segmentation threshold t, and taking the point which is at the maximum value in the matrix P and is larger than the threshold t as a pending segmentation point P _i ；

d. For each pending partition point p _i Combining the division points with the horizontal distance smaller than 100 pixels into a group;

e. taking the center of the undetermined dividing point in each group as a final dividing point f _i And cut the picture along the vertical direction;

continuing to recursively segment the residual content of the image to obtain a final layout structure distribution list records, wherein each record _i X, y, w, h representing a region, corresponding to the abscissa, ordinate, width and height in the original image, respectively;

the method further comprises the step of judging the selector region, and the specific method is as follows:

a. for each target area rect _i Performing primary expansion treatment by using a 3x3 operator;

b. extracting all connected domains C in the region;

c. calculating each connected domain c _i Center point coordinate p of (2) _i ；

d. Calculating each connected domain c _i To other c _j Searching the nearest neighbor of each connected domain and putting the distance of the nearest neighbor into a list L;

e. calculating the variance V of L; if V is larger than the threshold t and the number of connected domains is larger than n, the current area is judged to be the choice question area.

4. The answer sheet layout analysis method based on target detection according to claim 1, wherein the background text extraction in the fourth step is performed by using MSER and KNN:

a. at rect _i Extracting all stroke areas c by using MSER;

b. for each stroke area c, calculating the area and the value of w/h and h/w;

c. if the stroke area is smaller than 5 pixels or max (w/h, h/w) >10, the target is too small or is an interference line instead of a stroke, and c meeting the condition is removed;

d. calculating remaining Stroke region c _i To c _j The distance of the distance is amplified by 3 times to form a matrix D, so that interference caused by crossing rows is reduced;

e. calculate each c _i Up to the nearest 4 c _j And sets (i, j) and (j, i) to 1 in the communication matrix Con;

f. traversing Con to obtain all groups by using a depth traversal algorithm;

g. and using the minimum circumscribed rectangle of each group as an extraction boundary to obtain text content.

5. The answer sheet layout analysis method based on target detection according to claim 1, wherein the missing complement and edge cutting method in the fifth step is specifically as follows:

a. distributing the target detection result to each rect according to the coordinates and the overlapping area _i In (a) and (b);

b. erasing the target area detected by YOLOv5 s-DC;

c. detection of rect by MSER algorithm _i The remaining strokes and characters;

d. detection result c of MSER _i With the target detection result y _i Compare, if c _i And y is _i When adjacent edges or overlapping parts exist, the two edges or overlapping parts are combined into a target, and when the two edges or overlapping parts are combined, only the areas of the coordinate overlapping parts in the two rectangular frames are connected, and the circumscribed rectangles of the two edges or overlapping parts are not used;

after completion, cutting the edge of each text line, and the specific method is as follows:

a. calculating a gradient matrix grad at each point in the input image I by using a Sobel operator, and using the sum of absolute values of gradient values in two directions as a gradient result;

b. let M be the minimum path consumption matrix, B be the backtracking path storage matrix, initialize M and B as the all 0 matrix with the same size as I; the initial iteration coordinates are (i, j), the value of i is the row coordinates of j columns of minimum grads, and the initial value of j is 0, which means calculation from left to right;

c. for each I, j iterates from 0 to w-1, w is the width of the image I, calculating the minimum energy sum according to the state transition of dynamic programming, and storing the current minimum energy sum corresponding to the last step of coordinates into M and B, so as to represent the shortest path from (I, 0) to the current coordinates;

d. iterating each i repeatedly, and storing the consumption and coordinates of each path into M and B;

e. taking the point with the minimum cost and the minimum cost in the path end point on the right side of the M matrix as the path end point, backtracking according to the coordinates in the B, and adding the coordinates (i, j) obtained each time into a path set path;

f. outputting path as the edge segmentation result, wherein the path _i Representing the edge row coordinates of the ith column.

6. The answer sheet layout analysis method based on target detection according to claim 1, wherein the form analysis method in the step six is specifically as follows:

a. cutting a form area detected by target detection to obtain a form image I;

b. performing binarization processing on the I to obtain a binary image B, and performing inverse color processing;

c. detecting a connected domain list C existing in the image by using a corner detection function; traversing the list, calculating the minimum circumscribed rectangle of the list, and putting the minimum circumscribed rectangle into the list records;

d. taking the rectangle with the largest area in the records as a table positioning result, and cutting an input image along x, y, w and h of the rectangle respectively representing an abscissa, an ordinate, a width and a height;

e. performing multiple opening operations on the cut B image along the horizontal direction and the vertical direction by using a one-dimensional expansion operator to extend lines existing in the B image to obtain two characteristic diagrams DC and DR of rows and columns, wherein the two characteristic diagrams DC and DR represent transverse lines and vertical lines in a table respectively;

f. performing bitwise AND operation on the two feature images to obtain a feature image AND, wherein the position with 1 in the image is the intersection point of line segments in the table;

g. performing de-duplication treatment on the positions of all the intersection points along the horizontal direction and the vertical direction respectively by using a sliding window method to obtain a coordinate list X and a coordinate list Y;

h. for all X taken out of X and Y _i And y _i Constituent structural point p _i,j Judging another structural point p to the right and the lower side _i+1,j And p _i,j+1 If there is a complete line of communication, then it is proved that there is a text cell _i,j Otherwise, removing the waste water; cell _i,j P at the upper left corner of (2) _i,j The lower right corner coordinates are p _i+1,j+1 ；

i. Finally, for each cell _i,j And taking the minimum circumscribed rectangle of the text as a cutting edge to obtain the text position.

7. The answer sheet layout analysis method based on target detection according to claim 1, wherein the character recognition method in the step seven is specifically as follows:

a. editing in a UTF-8 character library, drawing a vector diagram of special characters, designating character codes thereof and exporting an expanded character set file;

b. generating a special character training set in Text Recognition Generator tool by using the expanded character set;

c. using a pytorch framework to realize a convolutional cyclic neural network (CRNN) network and training;

d. the trained model is used for identifying the extracted text and special characters;

e. after the sequence output by the network is converted by the connection time sequence classification CTC, the result is the final recognition result;

f. to enable visualization of results under a character set system, special characters output at the time of recognition are replaced with regular characters are used.