CN113076900A

CN113076900A - Test paper head student information automatic detection method based on deep learning

Info

Publication number: CN113076900A
Application number: CN202110388294.XA
Authority: CN
Inventors: 陈向乐; 黄双萍
Original assignee: South China University of Technology SCUT
Current assignee: South China University of Technology SCUT
Priority date: 2021-04-12
Filing date: 2021-04-12
Publication date: 2021-07-06
Anticipated expiration: 2041-04-12
Also published as: CN113076900B

Abstract

The invention discloses a test paper head student information automatic detection method based on deep learning, which comprises the following steps: s1, acquiring data, namely scanning the front surfaces of a plurality of student test papers by using a scanner to obtain a whole picture of the plurality of test papers; s2, marking data, manually marking the image of the paper head to obtain a detection frame of student information, and dividing a training set and a test set; s3, expanding the data size through the synthesized data; s4, constructing a text detector, wherein the text detector is constructed by using a convolutional neural network, comprises a feature extraction network, a candidate text region generation network, a region feature sampling module and a text positioning network, and different loss functions are designed for each component network; s5, a training text detector; and S6, testing, namely inputting the test data into the trained text detector for detection. The method can detect the student information of the print to-be-filled item at the head of the test paper and the handwriting, and has the characteristic of high accuracy.

Description

Test paper head student information automatic detection method based on deep learning

Technical Field

The invention belongs to the technical field of computer vision, and particularly relates to a test paper head student information automatic detection method based on deep learning.

Background

Computer vision is an important research direction in the field of artificial intelligence, and has important application in the aspects of automatic driving, smart cities, man-machine interaction and the like. Among them, text detection is an important branch of the computer vision field, and has been rapidly developed in recent years.

The character detection has relevant application in the field of education. In teaching practice, a teacher needs to grade test paper of students, follow-up work usually includes recording student information and scores of the test paper into an electronic system, and examination conditions are conveniently counted and a teaching scheme is improved. However, in the actual working process, if a teacher carries many classes and subjects, the excessive test paper information entry work will undoubtedly increase the extra energy of the teacher. Therefore, it is very meaningful to find an automatic and accurate student information input method.

In recent years, the research progress of deep neural networks has promoted the rapid development of target detection directions, and more detection algorithms are proposed.

Disclosure of Invention

The invention mainly aims to overcome the defects of the prior art and provides a test paper head student information automatic detection method based on deep learning, which can detect the student information of the print to-be-filled item and the handwriting of the test paper head and has the characteristic of high accuracy.

In order to achieve the purpose, the invention adopts the following technical scheme:

the method for automatically detecting the information of the test paper head students based on deep learning comprises the following steps:

s1, acquiring data, namely scanning the front surfaces of a plurality of student test papers by using a scanner to obtain a plurality of test paper full pictures, and cutting the head positions of the test paper images to obtain a plurality of test paper head images;

s2, marking data, manually marking the image of the paper head to obtain a detection frame of student information, and dividing a training set and a test set;

s3, synthesizing data, and expanding data volume through synthesized data;

s4, constructing a text detector, wherein the text detector is constructed by using a convolutional neural network, comprises a feature extraction network, a candidate text region generation network, a region feature sampling module and a text positioning network, and different loss functions are designed for each component network;

s5, training a text detector, setting training relevant parameters by adopting a pre-training model, and inputting labeled data into the text detector for training;

and S6, inputting the test data into the trained text detector for detection to obtain the detection result and probability of the student information.

Further, the step S2 specifically includes:

marking software is adopted to manually mark a horizontal rectangular frame of student information, including the marking of positions and categories;

recording the coordinates of the upper left corner of the horizontal rectangular frame and the width and height data in a file;

images were randomly divided into training and test sets.

Further, the step S3 specifically includes:

s31, carrying out data statistics analysis on the manually marked real data, wherein the data statistics analysis comprises the aspect ratio of the head image of the test paper, the aspect ratio and the size of the marking frames and the distance between the marking frames;

s32, setting the width and height of the generated image and the text interval according to the data statistical result, automatically generating a test paper header image containing the items to be filled but not filled with student information, and simultaneously storing the categories and coordinates of the items to be filled;

s33, crawling the linguistic data of the student information on the Internet, wherein the linguistic data comprise names, classes and schools of the students, filtering the character information with the length larger than 10, and storing the character information into different json files according to the items to which the information belongs, so that each json file forms a corpus containing the student information of different items;

s34, downloading a Chinese handwriting data set as an image library for subsequently pasting single handwritten character images;

s35, for each item to be filled in the head of the test paper, randomly selecting a piece of information from the corresponding item corpus, for each character of the piece of information, randomly selecting one single character image from the group of corresponding images, wherein the single character image corresponds to a group of single character images handwritten by different people in the image library, and pasting the single character image to the right side of the item to be filled in the head image of the test paper in sequence;

s36, performing affine transformation, adding salt and pepper noise, rotation and Gaussian blur on the test paper head image;

and S37, synthesizing a plurality of images based on the steps S31 to S36, and combining the images with artificially labeled real data to form a training set.

Further, the feature extraction network specifically includes:

the feature extraction network adopts ResNet50 and a bidirectional feature pyramid network BiFPN in a residual neural network, and the ResNet50 improves the feature extraction capability and relieves the network degradation problem through a shortcut connection mode;

and the BiFPN performs bottom-up and top-down fusion on the extracted features of different layers simultaneously to finally obtain a multi-channel feature map F1.

Further, the network for generating the candidate text region specifically includes:

inputting the multi-channel feature map F1 into a candidate text region to generate a network, and obtaining a candidate text region R;

the candidate text region generation network comprises a two-classification network and a detection frame regression network;

in a binary network, F1 is input into a convolutional layer 256C with a convolutional kernel size of 3 x 3 and a step size of 1, and a feature map F2 of 256 channels is output; inputting the feature map F2 into the convolution layer 2kC, wherein the convolution kernel size is 1 x 1, the step length is 1, and the number of output channels is 2 k;

inputting F1 into the convolutional layer 256C in a detection frame regression network, performing feature extraction to obtain a feature map F2, inputting the feature map F2 into the convolutional layer 2kC, and obtaining 4k coordinate regression results;

each pixel point of the feature map F1 predefines k anchor points with different sizes and proportions, and k candidate regions mapped back to the original image can be obtained based on anchor regression, wherein each candidate region comprises 2 classification confidence degrees and corresponds to 2k outputs of the two-classification network.

Further, the region feature sampling module specifically includes:

given a feature map F1 of the whole map and a candidate text region R, the corresponding region of F1 is divided into m × m portions, and a feature vector is sampled for each portion to obtain a local region feature map F3 of m × m size.

Further, the text positioning network specifically includes:

inputting the local region feature map F3 into a text positioning network to obtain the probability of each region belonging to the text;

the text positioning network comprises two branches, a segmentation branch, a detection box regression branch and a classification branch; the detection frame regression and classification branches comprise a detection frame regression branch and a detection frame classification branch;

in the segmentation branch, F3 is input into a full convolution network, a text segmentation map Mask of an input image is obtained, and text pixels and background pixels are distinguished at a pixel level;

in the detection box regression branch, inputting F3 into the full connection layer, and performing regression on the candidate text region R to obtain a detection box of the text;

in the detection box classification branch, F3 is input into the full link layer, the region inside the detection box is classified, and the probability that the region belongs to the text is output.

Further, the loss function is specifically:

for the segmentation branch of the text positioning network, adopting Diceloss, specifically:

wherein X is a predicted segmentation graph, and Y is a real labeled segmentation graph;

for the detection branches of the text positioning network and the candidate text area generation network, an IoU loss is adopted, and the method specifically comprises the following steps:

Lbox＝1-IoU

wherein D is a detection frame, and G is a real labeling frame;

for the text positioning network and the classification branch of the candidate text region generation network, a binary cross entropy loss function is adopted, and the method specifically comprises the following steps:

wherein, p represents the prediction probability,

representing a real category;

the final loss function is defined as:

L＝L_mask+L_box+L_cls。

further, the step S5 is specifically:

taking a model trained by an Imagenet classification task as a pre-training model of the feature extraction network, and initializing parameters;

setting parameters related to training, updating model parameters by adopting a random gradient descent method, setting an initial learning rate as lr, weight attenuation as weight _ decay, the number of pictures for batch training each time as batch _ size, iteration times as iters, a learning rate updating strategy as step, an updating coefficient as lambda and an updating step as stepsize;

in the candidate text region generation network, the anchors are set to have sizes of 322, 642, 1282, 2562, and 5122, and aspect ratios of 1:1, 1:2, and 2: 1;

and training the text detector, reading pictures and labels in the training set in batches, inputting the pictures into the text detector to obtain a prediction result, calculating the loss generated by the prediction result and the labels, reducing the loss by using a gradient descent method, updating network parameters of the feature extraction network, the candidate text region generation network and the text positioning network, and iterating the processes to find the optimal parameter.

Further, the step S6 specifically includes:

inputting the pictures in the test set into a trained text detector for forward reasoning;

after the detection result is obtained, the result is automatically compared with a real label by using a program to obtain the detection accuracy and recall rate, and the harmonic mean of the detection accuracy and the recall rate is calculated to be used as an evaluation index of the whole graph;

and (4) randomly selecting the detection effect of a plurality of images, automatically framing out the student information and the affiliated items of each image, and simultaneously carrying out judgment probability.

Compared with the prior art, the invention has the following advantages and beneficial effects:

1. according to the invention, an automatic learning detection algorithm of a deep network structure is adopted, so that effective expression can be well learned from data, and the detection accuracy is improved; due to the adoption of the end-to-end design, compared with the traditional manual entry, the accuracy is higher, and meanwhile, errors in the manual entry are avoided; the method has high detection accuracy and strong robustness, and can effectively detect the student information of the print to-be-filled item and the handwriting of the test paper roll head.

Drawings

FIG. 1 is a flow chart of the detection method of the present invention;

FIG. 2 is a data acquisition and processing flow diagram of the present invention;

FIG. 3 is a flow chart of the data synthesis of the present invention;

FIG. 4 is a diagram of the deep convolutional neural network of the present invention;

FIG. 5 is an example of the test results of the present invention.

Detailed Description

The present invention will be described in further detail with reference to examples and drawings, but the present invention is not limited thereto.

Examples

As shown in FIG. 1, the method for automatically detecting student information at the head of a test paper based on deep learning of the invention comprises the following steps:

and S1, acquiring data, namely, as shown in fig. 2, scanning the front surfaces of a plurality of student test papers by using a scanner to obtain a plurality of test paper full drawings, ensuring that no curling or folding phenomenon exists on the test paper pages in the scanning process, centering the positions of the test paper sheets, and cutting the test paper full drawings to obtain test paper heads which contain all personal information of students, including names, classes, seat numbers and the like, and simultaneously not including excessive non-student information areas, such as test paper titles, teacher scores and the like.

S2, labeling data, manually labeling the image of the paper head to obtain a detection frame of student information, and dividing a training set and a test set, as shown in FIG. 2, specifically comprising the following steps:

s21, manually calibrating a horizontal rectangular frame of student information by adopting special labeling software, including the calibration of positions and categories; the marked categories include two categories, namely, the items to be filled in the printed form, and the specific information handwritten by students;

s22, recording the coordinates of the horizontal rectangular frame and the category to which the rectangular frame belongs in the json file. The frame coordinates are coordinates of the upper left corner and the width and the height of the rectangular frame, and each coordinate value and the category to which the coordinate value belongs are separated by a comma;

s23, randomly dividing the images into a training set (about 2500 sheets) and a testing set (about 500 sheets).

S3, synthesizing data, wherein the student information is written by a black pen of a student, so that the student information is not finished, the student information is similar to the print characters of the paper roll head, and the paper roll head of the test paper has various styles, so that the detection difficulty is high, and tens of thousands of training data are needed for improving the performance of a model; the data volume can be expanded by synthesizing the data, and the cost of manual labeling can also be reduced, as shown in fig. 3, the method comprises the following steps:

s31, carrying out data statistics analysis on the manually marked real data, wherein the data statistics analysis comprises the aspect ratio of the head image of the test paper, the aspect ratio and the size of the marking frames, the distance between the marking frames and the like;

and S32, setting parameters such as width and height of the generated image, text spacing and the like according to the data statistics result, and automatically generating a test paper head image which contains the items to be filled in but does not fill in the student information. Meanwhile, the category and the coordinate of the item to be filled are stored, so that the information and the position coordinate of the pasted student can be conveniently determined when the handwritten single character image is pasted subsequently;

s33, crawling the linguistic data of the student information on the Internet, including names, classes, schools and the like of the students, filtering out the text information with the length being more than 10, and storing the text information into different json files according to the items to which the information belongs, wherein each json file forms a corpus containing the student information of different items;

s34, downloading a Chinese handwriting data set issued by the automation of the Chinese academy of sciences as an image library for subsequently pasting the handwritten single character images;

and S35, randomly selecting a piece of information from the corresponding project corpus for each project to be filled in the test paper header. For each character of the information, the image library has a group of single character images corresponding to the character, which are handwritten by different people, so that one single character image can be randomly selected from the group of corresponding images, and the single character image is sequentially pasted to the right side of the item to be filled in the image at the head of the test paper;

s36, performing affine transformation, adding salt and pepper noise, rotation, Gaussian blur and other operations on the test paper head image;

s37, based on steps S31 to S36, 20000 data images are synthesized and combined with 2500 artificially labeled real data images to form a training set.

S4, constructing a text detector, wherein the text detector is a double-stage text detector and comprises a feature extraction network, a candidate text region generation network, a region feature sampling module and a text positioning network;

in this embodiment, the feature extraction network adopts ResNet50 and a bidirectional feature pyramid network bipfn in a residual neural network;

the ResNet50 improves the feature extraction capability and relieves the problem of network degradation through a shortcut connection mode, the BiFPN performs bottom-up and top-down fusion on the extracted features of different layers simultaneously, and a multi-channel feature map F1 is finally obtained;

after the feature map F1, the candidate text region is accessed to generate a network, and a candidate text region R is obtained:

in the present embodiment, as shown in fig. 4, the candidate text region generating network includes a two-classification network and a detection box regression network;

Dividing a corresponding region of F1 into m portions on a given whole feature map F1 and a candidate text region R, and sampling a feature vector for each portion to obtain a local region feature map F3 with the size;

in this embodiment, as shown in fig. 4, the text positioning network includes two branches, a segmentation branch, and a detection box regression and classification branch; the detection frame regression and classification branches comprise a detection frame regression branch and a detection frame classification branch;

In this embodiment, for the segmentation branch of the text positioning network, a piece loss is adopted:

generating detection branches of the network for the text positioning network and the candidate text area, and adopting IoU loss:

L_box＝1-IoU

wherein D is a detection frame, and G is a real labeling frame;

generating classification branches of the network for the text positioning network and the candidate text regions, and using a binary cross entropy loss function:

wherein, p represents the prediction probability,

representing a real category;

the final loss function is defined as:

L＝L_mask+L_box+L_cls

s5, inputting the data with labels into a text detector to train to obtain a model, specifically:

s51, in this embodiment, parameters related to training are set: updating model parameters by adopting a stochastic gradient descent method, setting an initial learning rate lr to be 0.01, weight _ decay to be 0.0005, the number of pictures of batch _ size of each batch training to be 8, iteration times iters to be 50000, a learning rate updating strategy to be step, an updating coefficient lambda to be 0.1, and updating step sizes to be 30000 and 40000. In the candidate text region generation network, the anchors are set to have sizes of 322, 642, 1282, 2562, and 5122, and aspect ratios of 1:1, 1:2, and 2: 1;

s52, using the model trained by the Imagenet classification task as a pre-training model of the backbone network to initialize parameters;

s53, training the convolutional neural network, and training the feature extraction network, the candidate text region generation network and the text positioning network by adopting an end-to-end training method, wherein the training method specifically comprises the following steps:

reading pictures and labels in the training set in batches, inputting the pictures into a text detector to obtain a prediction result, calculating the loss generated by the prediction result and the labels, reducing the loss by using a gradient descent method, updating network parameters of a feature extraction network, a candidate text region generation network and a text positioning network, and iteratively training the text detector to find the optimal parameter.

S6, testing the network, specifically including:

s61, inputting the pictures in the test set into the trained model for forward reasoning;

s62, after the detection result is obtained, automatically comparing the result with a real label by using a program to obtain the detection accuracy and the recall rate, and calculating the harmonic mean of the result and the recall rate as the evaluation index of the whole graph;

and S63, randomly selecting the detection effect of 30 images, and automatically framing the student information and the belonged items of each image with judgment probability.

As shown in fig. 5, the result of detecting a 4680 × 403 head image of a test paper is shown, in which student information and the items are outlined, and the upper left corner has a decision probability.

It should also be noted that in this specification, terms such as "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention. Thus, the present invention is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. The method for automatically detecting the information of the test paper head students based on deep learning is characterized by comprising the following steps:

s3, synthesizing data, and expanding data volume through synthesized data;

2. The method for automatically detecting student information in paper based on deep learning according to claim 1, wherein the step S2 specifically includes:

images were randomly divided into training and test sets.

3. The method for automatically detecting student information in paper based on deep learning according to claim 1, wherein the step S3 specifically includes:

4. The method for automatically detecting student information at the beginning of a test paper based on deep learning according to claim 1, wherein the feature extraction network specifically comprises:

5. The method for automatically detecting student information at the beginning of a test paper based on deep learning according to claim 4, wherein the candidate text area generation network specifically comprises:

6. The method for automatically detecting student information at the beginning of a test paper based on deep learning according to claim 5, wherein the region feature sampling module specifically comprises:

7. The method for automatically detecting the student information at the beginning of the test paper based on deep learning according to claim 6, wherein the text positioning network specifically comprises:

8. The method for automatically detecting student information at the beginning of test paper based on deep learning according to claim 7, wherein the loss function is specifically:

L_box＝1-IoU

wherein D is a detection frame, and G is a real labeling frame;

wherein, p represents the prediction probability,

representing a real category;

the final loss function is defined as:

L＝L_mask+L_box+L_cls。

9. the method for automatically detecting student information in paper based on deep learning according to claim 1, wherein the step S5 is specifically as follows:

and the training text detector reads pictures and labels in the training set in batches, inputs the pictures into the text detector to obtain a prediction result, calculates the loss generated by the prediction result and the labels, reduces the loss by using a gradient descent method, updates network parameters of the feature extraction network, the candidate text region generation network and the text positioning network, and iteratively trains the text detector to find the optimal parameter.

10. The method for automatically detecting student information in paper based on deep learning according to claim 1, wherein the step S6 specifically includes: