WO2020259060A1

WO2020259060A1 - Test paper information extraction method and system, and computer-readable storage medium

Info

Publication number: WO2020259060A1
Application number: PCT/CN2020/087211
Authority: WO
Inventors: 曾志辉; 欧阳一村; 许文龙; 贺涛; 邢军华
Original assignee: 深圳中兴网信科技有限公司
Priority date: 2019-06-26
Filing date: 2020-04-27
Publication date: 2020-12-30
Also published as: CN110414529A

Abstract

Disclosed are a test paper information extraction method and system, and a storage medium. The test paper information extraction method comprises: preprocessing a test paper image to obtain a binary image (S102); determining a layout area of the binary image (S104); acquiring text lines of the test paper image according to the layout area (S106); extracting a text image according to the text lines (S108); inputting the text image into a character recognition model to obtain text information of the test paper image (S110); correspondingly combining the text information and the text lines to obtain a target test paper image (S112); and extracting test paper information of the target test paper image according to a classification tag (S114).

Description

Test paper information extraction method, system and computer readable storage medium

This application claims the priority of a Chinese patent application filed with the Chinese Patent Office with an application number of 201910559124.6 on June 26, 2019. The entire content of this application is incorporated into this application by reference.

Technical field

This application relates to the field of electronic teaching technology, for example, to a method, system and computer-readable storage medium for extracting test paper information.

Background technique

With the development of computer and Internet technology, more and more people use automated equipment to mark students' examination papers. In related technologies, the automatic scoring method usually can only analyze test papers with a fixed template, that is, it can only match the test paper with a variety of templates stored in the system, and use the matched template for analysis. However, in actual operation, the layout and type of many real test papers may not match the fixed template, so it is necessary to provide a solution that can accurately identify and automatically analyze any test paper (regular test paper, general answer sheet, special answer sheet, etc.) , To meet people's increasing electronic marking requirements.

Summary of the invention

This application at least solves the above-mentioned technical problems existing in related technologies.

This application proposes a method for extracting test paper information, including: preprocessing test paper images to obtain binary images; determining the layout area of the binary image; obtaining text lines of the test paper image according to the layout area; extracting text images from the text lines; The text image is input into the text recognition model to obtain the text information of the test paper image; the corresponding text information and the text line are merged to obtain the target test paper image; the test paper information of the target test paper image is extracted according to the classification label.

This application proposes a test paper information extraction system, which includes a memory, a processor, and a computer program stored in the memory and capable of running on the processor. The processor implements the test paper information extraction method of any of the above technical solutions when the processor executes the computer program.

This application proposes a computer-readable storage medium that stores a computer program, and when the computer program is executed by a processor, the method for extracting test paper information as any of the above technical solutions is realized.

Description of the drawings

Figure 1 shows a schematic flow chart of a method for extracting test paper information according to an embodiment of the present application;

Fig. 2 shows a schematic flow chart of a method for extracting test paper information according to another embodiment of the present application;

Fig. 3 shows a schematic flow chart of a method for extracting test paper information according to another embodiment of the present application;

FIG. 4 shows a schematic flow chart of a method for extracting test paper information according to another embodiment of the present application;

Figure 5 shows a test paper image of an embodiment of the present application;

FIG. 6 shows an analysis result image of the test paper layout area of an embodiment of the present application;

Figure 7 shows a layout area of Figure 5;

Figure 8 shows another layout area of Figure 5;

FIG. 9 shows the text line detection result image of FIG. 7;

FIG. 10 shows a text information extraction result image of an embodiment of the present application;

FIG. 11 shows a schematic diagram of constructing a text recognition model according to an embodiment of the present application;

Fig. 12 shows a schematic block diagram of a test paper information extraction system according to an embodiment of the present application.

Detailed ways

The application will be described below with reference to the drawings and specific implementations.

In the following description, many details are set forth in order to fully understand this application. However, this application can also be implemented in other ways different from those described here. Therefore, the scope of protection of this application is not limited to the specific implementations disclosed below. Limitations of cases.

An embodiment of the present application proposes a method for extracting test paper information, and FIG. 1 shows a schematic flowchart of the method for extracting test paper information according to an embodiment of the present application. Among them, the method includes:

S102, preprocessing the test paper image to obtain a binary image;

S104: Determine the layout area of the binary image;

S106: Obtain the text line of the test paper image according to the layout area;

S108, extract a text image according to the text line;

S110: Input the text image into the text recognition model to obtain the text information of the test paper image;

S112, correspondingly merge the text information and the text line to obtain the target test paper image;

S114: Extract test paper information of the target test paper image according to the classification label.

The test paper information extraction method provided in this application combines image processing algorithms, natural language processing algorithms, and deep learning neural network model technology. The binary image is obtained by preprocessing the test paper image, and the binary image is analyzed to determine the layout area of the binary image, namely Obtain the typesetting information of the test paper, detect the text line of each layout area, traverse the text line of each layout, take the largest circumscribed rectangular area of the text line to cut out the corresponding text image, and input the text image into the text recognition (Optical Character Recognition) , OCR) model is matched, the text information of the test paper image is recognized, the text information and the text line are merged correspondingly, and the target test paper image of the recognized text information is obtained, and the test paper information in the target test paper image is extracted according to different classification tags, for example, Candidate information, test question information, etc., output all test paper information. Through the above test paper information extraction method, the typesetting information of the test paper can be automatically identified. Even if the layout and type of the test paper are different, the test paper image can be accurately identified and automatically analyzed to obtain test paper information, which not only realizes efficient and accurate automatic scoring, but also It can also improve the scope of application of the system, and upload the identified test paper information and typesetting information to the database to build a knowledge system, which is conducive to the automatic composition of educators, thereby effectively reducing the workload of educators and satisfying users Various needs.

In an embodiment, the preprocessing is binarization processing. The binary image can also be smoothed and image tilted according to actual needs. The image tilt processing includes: projecting the binary image so that the edge of the binary image The position generates a corresponding mark on the projected image, determines the position of the oblique image according to the mark, and rotates the position of the oblique image to achieve image correction according to the angle between the edge of the oblique image and the standard horizontal direction or the standard vertical direction. The text behavior uses the image processing function (findcontours function) of the computer vision library (opencv) to identify the rectangular frame with text information in the binary image.

Fig. 2 shows a schematic flowchart of a method for extracting test paper information according to another embodiment of the present application. Among them, the method includes:

S202, preprocessing the test paper image to obtain a binary image;

S204: Determine a sub-image of the binary image according to the first preset size;

S206: Detect lines of sub-images;

S208: Select the line of the sub-image whose line length meets the preset length range and the area at both ends of the line is blank as the binding line;

In this embodiment, the areas at both ends of the line obtained by the sub-image are the area between the first end of the line and the first edge of the sub-image and the area between the second end of the line of the sub-image and the second edge of the sub-image.

S210: Determine the text area of the binary image according to the gutter;

S212: Determine the central area of the text area according to the second preset size;

S214, whether a separator is detected in the central area, if the separator is detected in the central area, go to S216, and if the separator is not detected in the central area, go to S218;

S216: Determine the layout area according to the separator symbol, and enter S224;

S218: Determine the segmentation area of the text area according to the third preset size;

S220: Whether a separator is detected in the divided area, if the separator is detected in the divided area, go to S216, and if the separator is not detected in the divided area, go to S222;

S222: Use the text area as a layout area;

S224: Obtain the text line of the test paper image according to the layout area;

S226: Extract a text image according to the text line;

S228: Input the text image into the text recognition model to obtain the text information of the test paper image;

S230: Correspondingly merge the text information and the text line to obtain the target test paper image;

S232: Extract test paper information of the target test paper image according to the classification label.

In this embodiment, the sub-images of the binary image are segmented on one side of the binary image according to the first preset size, all the lines in the sub-image are detected by the straight line detection algorithm, and the lines of all the sub-images are traversed. If the length meets the preset length range, and the area at both ends of the line is blank, the line of the sub-image is used as a binding line. If there is no line that meets the conditions, the binary image is divided into the other side of the binary image according to the first preset size The sub-image of the image, the gutter detection is performed again, if there is a gutter in the sub-image, the text area of the binary image is determined according to the gutter, where the first preset size and preset length range can be rationalized according to the layout parameters of the actual test paper Set up. Through the above technical solution, the binding line of the test paper image can be accurately identified, which is convenient for further analysis and recognition of the test paper layout area.

If there is no gutter in the sub-image, the binary image itself is used as the text area, and the center of the text area is used as the axis. The central area of the text area is determined according to the second preset size, and the separator is detected in the central area. The separator divides the text area to get the layout area. In order to avoid misjudgment problems caused by different test paper layouts, the separator of the divided text area can also be detected again. If there are still separators in the divided text area, the text area will be further divided according to the division symbol. Get a more accurate layout area. If no separator is detected in the central area, the text area is divided according to the third preset size to obtain at least two divided areas, and the separator is detected in each divided area. If the separator is detected in the divided area, then The separator divides the text area to obtain the layout area. If no separator is detected in the divided area, the text area is regarded as the layout area. Through the above embodiments, even if the layout and type of the test paper are different, the layout area can be accurately identified, and the probability of misjudgment can be reduced, so that the test paper image can be accurately identified and automatically analyzed to obtain test paper information, which not only realizes efficient and accurate automatic scoring , Can also improve the scope of application of the system.

In one embodiment, if there is a left gutter, the image on the right side of the gutter is taken as the text area; if there is a right gutter, the image on the left side of the gutter is taken as the text area.

In an embodiment of the present application, optionally, detecting the separation symbol includes: performing projection processing on the central area or the divided area to obtain the blank area of the binary image; in the case where the width of the blank area is greater than the width threshold, the blank area As a separator.

In this embodiment, the projection process is performed through the central area or the divided area, and the number of count 0 in the vertical direction can be counted to obtain the projection result array, and the blank area of the binary image is determined according to the projection result array. If the width of the blank area is greater than the width threshold , Use the blank area as the separator, and then divide the text area according to the separator to obtain the layout area, which is convenient for identifying the text information of the test paper image according to the layout area, and realizes efficient and accurate automatic scoring. In this embodiment, the width threshold can be rationally set according to the parameters of the conventional test paper layout.

In an embodiment of the present application, optionally, detecting the separation symbol includes: performing blurring and/or denoising processing on the central area or the segmented area to obtain lines of the binary image; filtering according to a preset angle range and a preset length threshold Binary image lines to obtain the target line; when the length of the target line is greater than the first preset length, or the length of the target line is greater than the second preset length, and the width of the title area and blank area at both ends of the target line is the same as the length of the target line When the sum is greater than the first preset length, the line of the binary image is used as the separator.

In this embodiment, the width of the title area and the blank area at both ends of the target line is the width of the title area between the first end of the target line and the first edge of the binary image, and the second end of the target line and the second edge of the binary image. The width of the blank area between the edges.

In this embodiment, blur and/or denoise processing is performed on the central area or the segmented area, the lines in the binary image are detected, and all the detected lines of the binary image are filtered according to the preset angle range and length threshold to obtain the target Line, if the length of the target line is greater than the first preset length, or while the length of the target line is greater than the second preset length, the sum of the width of the title area and the blank area at both ends of the target line and the length of the target line is greater than the first In the case of preset length, the line of the binary image is used as a separator, and the text area can be divided according to the separator to obtain the layout area, which is convenient for identifying the text information of the test paper image according to the layout area, and realizes efficient and accurate automation Scoring.

In this embodiment, the first preset length and the second preset length can be rationally set according to the parameters of the conventional test paper layout.

In one embodiment, the Hough transform function (hough lines function) of the computer vision library (opencv) is used to identify the lines of the binary image.

Fig. 3 shows a schematic flow chart of a method for extracting test paper information according to another embodiment of the present application. Among them, the method includes:

S302, preprocessing the test paper image to obtain a binary image;

S304: Determine the layout area of the binary image;

S306, identifying a rectangular frame in the layout area;

S308: Filter the widths of all rectangular boxes according to the preset width range to obtain multiple target widths;

S310: Count the number of rectangular frames corresponding to each target width among the multiple target widths;

S312: Select the target width corresponding to the largest number of rectangular boxes as the text line width;

S314: Determine a text box according to the width of the text line;

S316, whether the current text box and the previous text box meet the preset conditions, if the current text box and the previous text box meet the preset conditions, go to S318, and the current text box and the previous text box do not meet the preset conditions, go to S320;

S318, merge the current text box and the previous text box to obtain a piece of text line;

S320, the current text box and the previous text box are respectively used as a piece of text line;

S322: Extract a text image according to the text line;

S324: Input the text image into the text recognition model to obtain the text information of the test paper image;

S326, correspondingly merge the text information and the text line to obtain the target test paper image;

S328: Extract test paper information of the target test paper image according to the classification label.

In one embodiment, the preset condition is that the vertical distance between the center point of the current text box and the center point of the previous text box is less than the first distance threshold, and the horizontal distance between the center point of the current text box and the center point of the previous text box The distance is less than the second distance threshold.

In this embodiment, the outer edge contour existing in the layout area is identified, the largest circumscribed rectangle of the outer edge contour is taken to form a rectangular frame, the width of all the detected rectangular frames is obtained, and the width of the rectangular frame is filtered according to the preset width range, Obtain multiple target widths, count the number of rectangular boxes corresponding to each of the multiple target widths, select the target width corresponding to the largest number of rectangular boxes as the text line width, determine the text box according to the text line width, and traverse all text Box, if the vertical distance between the center point of the current text box and the center point of the previous text box is less than the first distance threshold, and the horizontal distance between the center point of the current text box and the center point of the previous text box is less than the second distance threshold, Explain that the center points of the above two text boxes are almost on a straight line. At this time, the current text box and the previous text box are merged to obtain a text line, so that the text information of the test paper image can be extracted from the text line to realize accurate automatic scoring. Educators can construct a knowledge system based on the recognized text information, which is beneficial for educators to automatically organize test papers, thereby reducing the workload of educators and satisfying multiple needs of users. Among them, the first distance threshold and the second distance threshold are the allowable distance error values between the text boxes, which can be set reasonably according to typesetting experience.

In an embodiment of the present application, optionally, before determining the text line width according to the width of the rectangular box, the method further includes: the vertical distance between the center point of the current rectangular box and the center point of the previous rectangular box is less than a third distance threshold, And when the horizontal distance between the center point of the current rectangular frame and the center point of the previous rectangular frame is less than the fourth distance threshold, the current rectangular frame and the previous rectangular frame are merged.

In this embodiment, if the vertical distance between the center point of the current rectangular frame and the center point of the previous rectangular frame is less than the third distance threshold, and the horizontal distance between the center point of the current rectangular frame and the center point of the previous rectangular frame is less than the first Four distance thresholds, indicating that the center points of the two rectangular boxes are almost on a straight line and the distance is relatively close. At this time, the current rectangular box and the previous rectangular box are merged to reduce the number of effective rectangular boxes recognized. In the text line process, the system calculation is reduced and the efficiency of extracting text information is improved. In this embodiment, the third distance threshold and the fourth distance threshold are the allowable distance error values between rectangular boxes, which can be set reasonably according to typesetting experience.

In an embodiment of the present application, optionally, before inputting the text image into the text recognition model, the method further includes: obtaining text data and character data; encoding the text data and character data to obtain a recognition dictionary; and determining the text image set according to the text data ; Construct a text recognition model based on the recognition dictionary and text image collection.

In this embodiment, the text data is obtained, repeated characters in the text data are excluded, each character in the text data and the character data is encoded starting from 1, to obtain a recognition dictionary, and an image of each character in the text data is obtained according to the text data , Get the text image set. According to the recognition dictionary and the text image collection, the text recognition model is constructed, so that it is convenient to use natural language processing technology to extract multiple types of text information of the test paper, with higher accuracy, faster speed, and improved practicability.

In one embodiment, the overlapping part of the local text corpus and the "Chinese Character Coded Character Set for Information Exchange" (GB2312) is used as text data. Character data includes but is not limited to: Arabic numerals, English letters, punctuation marks, and special characters. Use the character processing function (drawtext function) of PIL (python image processing library) to draw text content on a fixed-size image to obtain an image of the character. Use DenseNet+CTC (dense convolutional network model + time series data classification) network to build an OCR model, and the following convolutional neural network can also be used to build a model:

LeNet (convolutional neural network model) + CTC;

AlexNet (Alex Deep Convolutional Neural Network Model) + CTC;

ZF (ZF network structure model) + CTC;

VGG (VGG network structure model) + CTC;

GoogleNet (Google network structure model) + CTC;

ResNet (deep residual network model) + CTC.

In one embodiment, in the case of building a recognition dictionary, limiting the number of characters in the recognition dictionary, for example, limiting the number of characters to about 4000, can effectively reduce the size of the character recognition model and reduce the amount of system calculation.

In an embodiment of the present application, optionally, extracting the test paper information of the target test paper image according to the classification label includes: the classification label includes a title, a big question and a small question; according to the classification key characters respectively corresponding to the title, the big question and the small question , Respectively determine the title text line, the big title text line and the small title text line; extract the test paper information according to the title text line, the big title text line and the small title text line.

In this embodiment, the classification tags include a title, a big question, and a small title. Each tag has its own classification key characters. In units of text behavior, the classification key characters are used to identify the starting position and the big question of the title in the target test paper image. The starting position of the title and the starting position of the sub-question are determined to determine the title text line, the main-topic text line, or the sub-title text line corresponding to the classification label, so as to classify the test paper information, thus according to the title text line and the main text The line and subtitle text lines extract different text information to obtain the corresponding test paper information, which is sequentially stored in the database. Use natural language processing technology to extract multiple types of text information of test papers, improve the accuracy of extracting test paper information, effectively reduce the workload of educators, and meet the increasing requirements of electronic scoring, automatic test paper composition, and automatic question storage.

In an embodiment, usually the test paper information is composed of a title, a large-question type, and small-question information. The title is used to describe information about the nature of the test question and candidate information, such as information such as test questions for a designated subject at a designated grade and stage. The question type is used to describe the category information of the test questions. The category information of the test questions includes multiple choice questions, calculation questions, applied questions, fill in the blanks, answer questions, multiple choice questions, multiple choice questions, essay questions, non-choice questions, experimental questions, Optional questions, optional exam questions and other question types, sub-question information can be divided into question number, question stem information and score information.

In an embodiment of the present application, optionally, before extracting the test paper information according to the title text line, the big title text line, and the subtitle text line, the method further includes: performing coordinate information processing on the target test paper image; If the abscissa exceeds the preset coordinate range, or the abscissa of the subtitle text line does not meet the sequence number increasing rule, the subtitle text line is deleted.

In this embodiment, coordinate information processing is performed on the target test paper image to obtain the coordinates of all text lines. If the abscissa of the subtitle text line exceeds the preset coordinate range, or the abscissa of the subtitle text line does not satisfy the sequence number increasing rule, Deleting the subtitle text line, on the one hand, can locate the position of the text information, on the other hand, the text line is calibrated through the coordinates of the text line to remove the misjudged text line and improve the accuracy of extracting test paper information.

As shown in Figure 4, the test paper information extraction method of another embodiment of the present application includes:

S402: Perform layout analysis on the input test paper image to obtain a rectangular area of the binding line and all rectangular areas of the layout;

S404: Perform text line detection on each layout;

S406: Perform OCR recognition on the text lines of each layout, and merge the results to obtain the final test paper text;

S408: Extract the text information of the test paper from the text;

S410: Extract the serial numbers of candidate sub-questions according to the information of the big questions, and generate a list of question numbers from the serial number features;

S412: Output all information of the test paper.

Methods as below:

1. The test paper image (img) obtained by scanning is shown in Figure 5. The layout of the test paper image img is analyzed, and the rectangular area of the gutter is obtained (if there is a gutter, the rectangular area of the gutter is obtained; if there is no gutter, there is no Gutter rectangular area), and all the rectangular areas of the layout, as shown in Figure 6 to Figure 8;

1.1 Check the gutter, as shown in Figure 6;

1.1.1 Take 1/5 of the length from the left of the img, and the subimage part_img with the same width as the img;

1.1.2 Use the line detection algorithm to detect all the lines in part_img, get the line set line_set, filter the lines with half the width of the img, and get the line set line_set2;

1.1.3 Sort the line set line_set2 in descending order by x coordinate;

1.1.4 Traverse the line set line_set2, if the line line_set[i] meets the following conditions, the line is the binding_line;

1) The width of the straight line is greater than 3/4 of the width of the img;

2) The area from the vertex to the upper edge of the straight line and the area from the vertex to the lower edge of the straight line are blank areas;

1.1.5 If there is no straight line that meets the conditions, take 1/5 of the length and the same width of the subimage part_img from the right to the left of the img, and repeat the steps of 1.1.2, 1.1.3, and 1.1.4.

1.2 Check the layout separator;

Layout separators can be: blank areas, dashed lines, and straight lines that exceed the specified size and width.

1.2.1 Take the image img2 of the text area of the test paper; if there is a left gutter, take the image of the area on the right of the gutter as img2; if there is a right gutter, take the image of the area on the left of the gutter as img2; if it does not exist Gutter, take the test paper image img as the text area image img2;

1.2.2 Prioritize analysis of even-numbered pages;

1.2.2.1 Take the central area image middle_img of the text area image img2 of the test paper, the length is 1/5 of the length of the text area image img2, and the width is the width of img2;

1.2.2.2 To detect the layout separator in middle_img of the central area image, the method is as follows:

1.2.2.2.1 Method of detecting blank area: Binarize the image middle_img in the central area to obtain the image binary_img, and perform vertical projection on the binary_img (count 0 in the vertical direction) to obtain the projection result array. If the width of the array is greater than The interval of the preset value, the position of the interval is the layout_line;

1.2.2.2.2 Method of detecting lines (straight lines, dashed lines): perform Gaussian blur denoising processing on the image middle_img in the center area to obtain the image img3, use the hough_lines function of opencv to detect the lines, filter the inclination angle not within the range of [70, 110], and the length For lines less than 50, the line set line_set3 is obtained. Traverse line_set3, if the line line[i] satisfies any of the following, the position of the line is the layout separator layout_line:

1) The length of the line is greater than 4/5 of the width of the image img3;

2) The length of the line is greater than 2/3 of the width of the image img3, and the lower end of the line is all blank areas. At the same time, the upper end of the line is the title, and the sum of the width of the title and the length of the line and the width of the blank area at the lower end of the line is greater than that of the image img3 4/5 of the width;

1.2.2.3 If the layout separator layout_line is detected, divide the text area image img2 with the layout separator symbol layout_line to obtain two regions rect1 and rect2. Repeat the steps 1.2.2.1 and 1.2.2.2 for the area rect1. Symbol layout_line1, continue to repeat the steps 1.2.2.1 and 1.2.2.2 for the area rect2. If the layout separator layout_line1 is detected and the separator layout_line2 is detected at the same time, the image img2 will be separated by the separator layout_line, the separator layout_line1 and the separator layout_line2. Divide into four columns; if layout_line1 is not detected, use the separator layout_line to divide the image img2 into two columns, namely layout area 1 (Figure 7) and layout area 2 (Figure 8);

1.2.3 If the delimiter is not detected in step 1.2.2, then analyze the layout of the three columns;

1.2.3.1 Take the image left_img at 1/3 of the length of the text area image img2, its length is 1/5 of img2, and its width is the width of img2;

1.2.3.2 Repeat the steps of 1.2.2.2 to detect the separator layout_line1. If the detection is successful, take the image right_img at 2/3 of the image img2, its length is 1/5 of img2, and the width is the width of img2; also repeat 1.2 .2.2 step, detect the separator layout_line2, if the detection is successful, use the separator layout_line1 and the separator layout_line2 to divide the image img2 into three columns; if the layout_line1 is not detected, the entire img2 is one column.

2. Perform text line detection on each layout, as shown in Figure 9 and Figure 10;

Record the detected layout area as layout_rects, traverse layout_rects, and perform text line analysis on layout_rects[i];

2.1 Binarize the picture img2 to get the picture binary_img2;

2.2 Use opencv's findcontours function to obtain the outer edge contour collection contours of the image binary_img2;

2.3 Traverse contours, take the largest circumscribed rectangle of contours[i], and get rectangular box rects;

2.4 Merge rectangles: if the vertical distance between the center points of two rectangles is less than 8, and the horizontal distance between the center point of one rectangle and the center point of the other rectangle is within the preset range;

2.5 Calculate the width of the text line: take the width and heights of all rectangular boxes in the rectangular box rects, remove the abnormal maximum and minimum values, and count the number of heights in the range of [height, height+C]. The height corresponding to the maximum number is Is the width of the text line (C is a constant, empirical value);

2.6 Take the text line × F (F is a constant greater than 1, such as 1.4) as the benchmark, and remove the line that exceeds the text × F

, The remaining rectangular boxes are text boxes that may contain text;

2.7 Combine the rectangles based on the text line×2: if the vertical distance between two rectangles is less than 8, and the horizontal distance is less than the text line×2;

2.8 Traverse the rectangular box rects from left to right, if the center point of the current text box and the center point of the previous text box are roughly in a straight line, merge the current text box into the previous text box to get a small text line ；

In one embodiment, the text line detection result of layout area 1 (FIG. 7) is shown in FIG. 9;

The process of 2.9 recursive 2.8 can get the entire text line text_lines.

3. Perform OCR recognition on the text lines of each layout, and merge to obtain the final test paper text paper_text;

3.1 Traverse the text_lines[i] of each layout, take the largest circumscribed rectangle area max_line_rect, and cut out the corresponding text image part_img from img2;

3.2 Input part_img into the pre-trained OCR model to generate text information;

3.3 Combine the above text information and text lines to get the final test paper text paper_text, as shown in Figure 10;

3.4 Build an OCR model;

3.4.1 Model data;

3.4.1.1 Use the existing text corpus to generate 4 million text data text_data with a group of 10 characters;

3.4.1.2 For the above text data text_data, exclude duplicate characters and get the dictionary dict1;

3.4.1.3 Take the intersection of dictionary dict1 and GB 2312 (national standard) character set as OCR recognition dictionary ocr_dict, and add Arabic numerals, English letters, punctuation marks, and special characters to ensure that the total number of characters is around 4000, effectively reducing the size of the model ；

3.4.1.4 According to the ascending order of the recognition dictionary ocr_dict, each character is coded from 1;

3.4.1.5 Convert the text data text_data to the code representation ocr_index_data corresponding to the recognition dictionary ocr_dict;

3.4.1.6 Use PIL's drawtext function for the text data text_data to draw the text content on the 280*32 image to get the image set ocr_img_data;

3.4.1.7 Randomly take 1/3 of the image set ocr_img_data, add Gaussian noise, or image blur or image tilt;

3.4.1.8 Finally get the training data set ocr_img_data, ocr_index_data;

3.4.2 Model network;

Use DenseNet+CTC (dense convolutional network + time series data classification) to build a network, where DenseNet is a 5-layer DenseBlock (network block), growth rate k (growth rate) = 4, as shown in Figure 11;

3.4.3 Model training;

The data generated in step 3.4.1 is divided into training set and validation set according to the ratio of 9:1; the maximum number of rounds of model training epochs=50, if the loss (loss) exceeds 3 rounds, the training will stop; the final model training is accurate The rate reached 0.993, and the accuracy rate of the verification set reached 0.986.

4. Extract the text information of the test paper from the text of the test paper;

Definition of test paper text information: test paper name, subject, unit, test type, test number area, name area, big question information (serial number, question type, score information, area, etc.);

4.1 The name of the test paper is extracted, as shown in Figure 10;

Traverse the first 5 lines of the test paper text. If there is one of the test paper name keywords in the first 5 lines of the test paper text, that line will be used as the test paper name. The test paper name keywords include: exam, test paper, test, test, simulation, etc.;

4.2 Subject extraction;

Traverse the first 5 lines of the test paper text. If there is one of the subject keywords in the first 5 lines of the test paper text, the keyword will be used as the subject. Subject keywords include: mathematics, Chinese, English, physics, chemistry, biology, geography, Politics, history;

4.3 Unit extraction;

Traverse the first 5 lines of the text of the test paper, if there is an expression (unit *) in the first 5 lines of the text of the test paper, then this line is used as a unit;

4.4 Examination type extraction;

Traverse the first 5 lines of the test paper text. If there is one of the test type keywords in the first 5 lines of the test paper text, this keyword is used as the test type. The test type keywords include: mid-term, final, simulation, competition, etc.

4.5 Extraction of test number area;

4.5.1 If there is a gutter in the test paper, traverse the text in the gutter area. If there is one of the following test number keywords in the test paper, the text line area where the keyword is located is the beginning of the test number area, and then expand the area upwards , Is the exam number area, the keywords of exam number include: exam number, student number, admission ticket number, etc.;

4.5.2 If there is no gutter in the test paper, traverse the first 5 lines of the test paper text. If the test paper has one of the test number keywords in 4.5.1, the text line area where the key word is located is the beginning of the test number area Position, and then expand the area to the right, which is the examination number area.

4.6 Name area extraction;

4.6.1 If there is a gutter in the test paper, traverse the text in the gutter area. If there is a keyword (name), the text line area where the keyword is located is the beginning of the name area, and then expand the area upwards, that is, the name area ；

4.6.2 If there is no gutter in the test paper, traverse the first 5 lines of the text of the test paper. If the key word (name) in 4.6.1 exists in the test paper, the text line area where the key word is located is the beginning of the name area. Expand the area to the right, which is the name area;

4.7 Extraction of major information;

Preset types of big questions: multiple-choice questions, calculation questions, application questions, fill-in-the-blank questions, answer questions, single-choice questions, multiple-choice questions, essay questions, non-choice questions, experimental questions, optional questions, selective examination questions, etc.;

4.7.1 Identify the position of the text line of the big title;

Traverse the text line, if the current text can match the key characters of the big title, such as "Chinese number" + "big title type" or "(" + "big title type" + ")" or "big title type", etc., then The text line of the text where the big question is located;

4.7.2 Take the text line where the step of 4.7.1 is located as the starting position of the area of the big question;

4.7.3 Take the matched "Chinese number" as the serial number of the big question;

4.7.4 Take the matched "big question type" as the question type of the big question;

4.7.5 Take the text of the big question and the next line of text, and match the following score rules as the score information of the big question;

Scoring rule 1:

1) This big question has a total of (\d{1,3}) small questions. *Each small question (\d{1,3}) points.*(Total|Full score)(\d{1,3}) points;

2) This big question has a total of (\d{1,3}) small questions. *Each small question (\d{1,3}\.\d) points.*(Total|Full score)(\d{1,3 }\.\d) points;

The matched value is used as the number of small questions, the score of each small question, and the total score of the big question in turn;

Scoring Rule 2:

This big question has a total of (\d{1,3}) small questions.*(Total|Full score)(\d{1,3}) points; the matched value is used as the number of small questions and the total score of the big question in turn .

5. Extract subtopic information from the text;

Definition of sub-question information: serial number, question type, score information, area, etc.;

5.1 According to the position of the big question, get the text big_question_texts of each big question;

5.2 Traverse the text big_question_texts, take out the text line that meets the following rules, as the starting position area of the candidate text;

Key characters of the subtitle: "Arabic numerals" + ", |.";

5.3 Filter the candidate questions through the following features;

1) The abscissa of the starting position of the big question is big_coordinate_x, if the abscissa of the coordinate of the small question is larger or smaller than big_coordinate_x, delete the small question;

2) If the abscissa of the sub-question coordinates does not satisfy the increasing sequence number, the sub-question will be deleted;

5.3.1 The remaining serial number is the serial number of the sub-topic below the main question, and the corresponding text line area is used as the starting position of the sub-topic area;

5.3.2 The ending position of each question is the starting position of the next question. If it reaches the end of the layout, the end of the layout is the ending position of the question area;

5.4 Extract sub-question score information;

If the text of the sub-question can match the following rules, take the corresponding result as the score information;

Rule 1: ((\d{1,3}) points);

Rule 2: ((\d{1,3}\.\d) points);

Rule 3: This sub-question ((total|full score)?)(\d{1,3}) points;

Rule 4: This sub-question ((Total|Full score)?)(\d{1,3}\.\d) points.

6. Output all test paper information.

Taking Figure 6 as an example, the layout analysis results are: the gutter area is represented as: [5,5,214,2330]; the layout area is represented as: [235,5,1505,2330], [1746,5,1559,2330 ];

Among them, the text line detection result of layout area 1 (Figure 7) is shown in Figure 9. The text line detection result of the test paper image is subjected to OCR recognition to obtain the text information, and the text information and the text line are merged to obtain the following result:

[['______ School 2013-2014 school year midterm self-examination papers for the first semester', [104,120,363,52]]]

[['Seventh grade_______', [104, 120, 363, 52]]]

[['(Exam time minutes, full marks', [104,410,363,32]]]

[[″, [211,467,1172,230]]]

[['Caution: Use blue and black steel', [85,713,846,33]]]

[['One. Multiple choice questions (9 sub-questions in this big question, 45.0 points in total)', [77,759,561,30]]]

[['1. Set A={xr\\^2-4x-3<0}, B={x[X-3>0}, then A∩B=)', [87,802,906, 34]]]

[['(-,-', [251,857,90,51]]]

[['2. Function V=2x\\^2-e\\^-in [-2,',[84,932,634,34]]]

[['A.', [142,985,31,276]]]

[[″, [193,1291,307,272]],['D',[784,1291,24,272]]]

[['3. Known arithmetic sequence {a-first 9 items', [84,1584,817,32]]]

[['A.100', [142, 1629, 85, 26]]]

[['4. Put the function V=2sin(X=)', [83, 1683, 1041, 46]], [", [628, 1683, 38, 46]]]

[['A.', [141, 1770, 32, 46]]]

[['5.⊿ABC's inner angle A, B, Ci', [84, 1856, 1196, 40]]]

[[″, [1096,1890,12,17]]]

[['v', [250, 1930, 34, 32]]]

[[″, [1173, 1976, 295, 346]]]

[['6. Function y = part of Asim(ox-p)', [84,1974,689,32]]]

[['A.', [141, 2032, 31, 44]]]

[['ν=2s such as /x-', [244,2119,198,44]]]

[['Page 1/A total of 4 pages', [703, 2205, 141, 25]]]

[['C.', [79,209,28,44]]]

[['V=2snr-', [182,295,145,44]]]

[['7. Known even function) is in the interval [①,',[20,382,1114,51]],[″,[841,382,28,51]]]

[['I2', [200,469,38,17]]]

[[″, [505,469,69,51]],['C.',[720,468,29,52]]]

[[″, [186,484,68,36]]]

[['8. Let the line l pass through one of the ellipses', [21,556,1388,47]]]

[[″,[504,643,10,17]],[″,[823,643,14,16]],[″,[1146,642,13,17]]]

[['A.', [78,654,31,26]]]

[[″,[504,677,12,16]],[″,[824,677,12,16]],[″,[1146,676,13,13]]]

[[″,[1041,719,162,219]],[″,[1306,718,97,220]]]

[['9. The figure is a combination of cylinder and cone', [20,717,961,29]]]

[['为()', [79,754,93,36]]]

[['20π', [185,849,46,24]]]

[['24π', [78,893,29,26]]]

[['28π', [183,937,46,23]]]

[['32π', [185,958,46,99]],[″,[1114,958,98,99]]]

[['2. Fill in the blanks (4 sub-topics in this big question, 20.0 points in total)', [13,1116,560,30]],[″,[24,1116,21,30]]]

[['10.⊿ABC's inner angles A, B, and C are opposite sides of a, b, c, respectively, if cosA=-, cosC=', [23,1171,917,49]],[″,[816 ,1172,21,48]],[″,[950,1171,32,49]],['a=1, then b=',[1002,1171,261,49]]]

[['11. Known hyperbola C:, [23,1259,277,52]], ['5--=(a>0,b>) the right term', [280,1259,1119,52 ]], [″, [329, 1259, 25, 52]]]

[['An asymptote intersects at two points M and N.', [80,1332,883,33]]]

[['12. If the straight line y=x-b is the curve p', [23,1376,1153,32]]]

[['13. Curve V=x\\^2-at point/], 2', [23,1432,637,51]],[″,[243,1432,38,51]]]

[['Three, answer questions (this big question has 10 sub-questions, a total of 120.0 points)', [13,1506,588,30]],[″,[23,1506,23,30]],[″, [25, 1506, 19, 30]]]

[['14.⊿The opposite sides of the inner angles A, B, and C of ABC are a, b, and c, respectively. It is known that 2cosC(acosB-bcosA)=c.', [23,1550,1057,32]]]

[['(I seek C;', [80,1594,111,30]]]

[['lⅡIf c=, ⊿Area of ABC', [80,1650,674,40]]]

[[″, [487,1686,12,16]]]

[['15.⊿ABC's inner corners A, B, C', [23,2041,978,50]]]

[['ll) seek cosB;', [79, 2115, 128, 32]]]

[['2) If a-c=6, ⊿Area of ABC', [79, 2158, 511, 32]]]

[['Page 2/A total of 4 pages', [648, 2205, 142, 25]]]

As shown in Figure 10, after processing the coordinate information of the text of the target test paper image, the following results are obtained:

Test paper name:'______ School 2013-2014 midterm self-examination paper for the first semester of the school year'

subject:"

unit:"

Exam type:'midterm'

Exam number area: [60, 150, 60, 200]

Name area: [60, 1640, 60, 200]

['______ School 2013-2014 midterm self-examination paper for the first semester of the school year']

['One, multiple-choice questions','big', ['One, multiple-choice questions (9 sub-questions in this big question, 45.0 points in total)'], [312,756,1428,43],{'total_score':' 45.0','number':'9','each_question_score':'5.0'}]

['1.','small', ['1. Set set A={xr\\^2-4x-3<0}, B={x[X-3>0}, then A∩B=) ','(-,-'], [322,807,1418,114], {'score':'5.0'}]

['2.','small',['2. Function V=2x\\^2-e\\^-in [-2,','A.','D'], [319,929, 1421, 645], {'score':'5.0'}]

['3.','small', ['3. Known arithmetic sequence {a-first 9 items','A.100'], [319,1583,1421,87], {'score':' 5.0'}]

['4.','small', ['4. Change the function V=2sm (X=÷)','A.'], [318, 1678, 1422, 159], {'score': '5.0 '}]

['5.','small',['5.⊿ABC's inner corners A, B, Ci', ",'v'], [319,1845,1421,124],{'score':'5.0' }]

['6.','small', ['6. Function y=Asin/ox-p) part', ",'A.','ν=2s such as /x-',' page 1 / total 4 pages'], [[319,1977,1421,350],[1766,214,1523,157]],{'score':'5.0'}]

['7','small', ['7. Even function is known) in the interval [①,','C.','I2', "], [1766,379,1539,162], {'score ':'5.0'}]

['8','small',['8. Let the line l pass through one of the ellipses',",'A.',"],[1767,550,1538,156],{'score':'5.0' }]

['9.','small',['9. As shown in the figure is a combination of cylinder and cone', ",' is:)', '20 尔','B.', '28T', '327'] , [1766,714,1539,399], {'score':'5.0'}]

['Two, fill in the blanks','big', ['two, fill in the blanks (4 sub-questions in this big question, 20.0 points in total)'], [1759, 1121, 1546, 38], {'total_score':' 20.0','number':'4','each_question_score':'5.0'}]

['10.','small',['10.⊿ABC's inner angles A, B, and C are opposite sides of a, b, c, respectively, if cosA=-, cosC=a=l, then b='] , [1769, 1168, 1536, 72], {'score':'5.0'}]

['11.','small', ['11. Knowing hyperbola C: 5--= (a>0, b>0) right term',' an asymptote intersects at two points M and N .'], [1769, 1249, 1536, 122], {'score':'5.0'}]

['12.','small', ['12. If the straight line y=x-b is the curve p'], [1769, 1380, 1536, 41], {'scor e':'5.0'}]

['13.','small',['13.Curve V=x\\^2-at point/],2'],[1769,1429,1536,66],{'scor e':' 5.0'}]

['Three, answer questions','big', ['three, answer questions (this big question has 10 sub-questions, a total of 120.0 points)'], [1759, 1504, 1546, 40], {'total_score':' 120.0','number':'10','each_question_score':'12.0'}]

['14.','small',['14.⊿The opposite sides of the inner angles A, B, and C of ABC are a, b, and c respectively. It is known that 2cosCracosB-bcos4)=c.','(I find C ;','LⅡIf c=, the area of dABC',"],[1769,1552,1536,486],{'score':'12.0'}]

['15.','small',['15.⊿ABC's inner angles A, B, C','ll) find cosB;','2) If ac=6, ⊿area of ABC','2 Page/A total of 4 pages'], [1769, 2046, 1536, 189], {'score':'12.0'}]

The results of extracting information about the type of questions are as follows:

['一','multiple choice question', [312,756,1428,43], {'total_score':'45','number':'9','each_qu estion_score':'5'}]

['1','Multiple choice question', [322,807,1418,114], {'score':'5'}]

['2','Multiple choice question', [319,929,1421,645], {'score':'5'}]

['3','Multiple choice question', [319,1583,1421,87], {'score':'5'}]

['4','multiple choice question', [318, 1678, 1422, 159], {'score':'5'}]

['5','Multiple choice', [319,1845,1421,124], {'score':'5'}]

['6','Multiple choice question', [[319,1977,1421,350], [1766,214,1523,157]], {'score':'5'}]

['7','Multiple choice question', [1766,379,1539,162], {'score':'5'}]

['8','Multiple choice question', [1767,550,1538,156], {'score':'5'}]

['9','Multiple choice', [1766,714,1539,399], {'score':'5'}]

['Two','fill in the blanks', [1759,1121,1546,38], {'total_score':'20','number':'4','each_qu estion_score':'5'}]

['10','fill in the blanks', [1769, 1168, 1536, 72], {'score':'5'}]

['11','fill in the blanks', [1769, 1249, 1536, 122], {'score':'5'}]

['12','fill in the blanks', [1769, 1380, 1536, 41], {'score':'5'}]

['13','fill in the blanks', [1769, 1429, 1536, 66], {'score':'5'}]

['Three','Question for solution', [1759, 1504, 1546, 40], {'total_score': '120','number': '10','each_question_score': '12'}]

['14','Problem to solve', [1769, 1552, 1536, 486], {'score':'12'}]

['15','Problem to solve', [1769, 2046, 1536, 189], {'score':'12'}]

In this embodiment, line detection and blank area detection are used to realize test paper layout analysis, which can automatically identify the typesetting information of test papers, and use OCR method based on deep learning convolutional neural network for test paper analysis to analyze test paper images. Perform accurate text recognition and use natural language processing technology to extract multiple types of text information from test papers. This not only realizes efficient and accurate automatic scoring, but also improves the scope of application of the system, thereby effectively reducing the workload of educators and satisfying the needs of users. Kind of demand.

According to the embodiment of the second aspect of the present application, a test paper information extraction system 500 is proposed, as shown in FIG. 12, including a memory 502, a processor 504, and a computer program stored in the memory 502 and running on the processor 504 When the processor 504 executes the computer program, the method for extracting test paper information in any of the foregoing embodiments is implemented.

According to an embodiment of the third aspect of the present application, a computer-readable storage medium is provided, on which a computer program is stored, and when the computer program is executed by a processor, the steps of the test paper information extraction method as in any of the above embodiments are implemented.

In the description of this specification, the terms "first" and "second" are only used for descriptive purposes, and cannot be understood as indicating or implying relative importance, unless expressly stipulated and limited otherwise; the terms "connected" and " "Installation" and "fixation" should be understood in a broad sense. For example, "connection" can be a fixed connection, a detachable connection, or an integral connection; it can be directly connected or indirectly connected through an intermediate medium. For those of ordinary skill in the art, the meaning of the above-mentioned terms in this application can be understood according to different situations.

In the description of this specification, the description of the terms "one embodiment", "some embodiments", "specific embodiments", etc. means that the features, structures, materials or characteristics described in conjunction with the embodiment or examples are included in the application In at least one embodiment or example. In this specification, the schematic representations of the aforementioned terms do not necessarily refer to the same embodiment or example. Moreover, the described features, structures, materials or characteristics can be combined in any one or more embodiments or examples in a suitable manner.

Claims

A method for extracting test paper information, including:

Preprocess the test paper image to obtain a binary image;

Determining the layout area of the binary image;

Obtaining the text line of the test paper image according to the layout area;

Extract a text image according to the text line;

Input the text image into a text recognition model to obtain the text information of the test paper image;

Correspondingly merge the text information and the text line to obtain a target test paper image;

Extract the test paper information of the target test paper image according to the classification label.
The method for extracting test paper information according to claim 1, wherein the determining the layout area of the binary image comprises:

Determining the sub-image of the binary image according to the first preset size;

Detecting the lines of the sub-image;

When the length of the line of the sub-image meets the preset length range, and the area between the first end of the line of the sub-image and the first edge of the sub-image, and the second end of the line of the sub-image and If the area between the second edges of the sub-image is a blank area, use the lines of the sub-image as a binding line;

Determining the text area of the binary image according to the binding line;

Determine the central area of the text area according to the second preset size;

Detecting a separation symbol in the central area;

In response to the detection result of detecting a separator in the central area, the layout area is determined according to the separator.
The method for extracting test paper information according to claim 2, wherein said determining the layout area of the binary image further comprises:

In response to the detection result that the separation symbol is not detected in the central area, determine the segmentation area of the text area according to a third preset size;

Detecting the separation symbol in the segmentation area;

In response to a detection result of detecting the separation symbol in the divided region, determining the layout area according to the separation symbol;

In response to the detection result that the separation symbol is not detected in the divided area, the text area is taken as the layout area.
4. The method for extracting test paper information according to claim 3, wherein the detection separator includes:

Performing projection processing on the central area or the segmented area to obtain the blank area of the binary image;

In a case where the width of the blank area is greater than the width threshold, the blank area is used as the separation symbol.
The method for extracting test paper information according to claim 3, wherein the detecting the separation symbol comprises: performing at least one of blurring and denoising processing on the central area or the segmented area to obtain the binary image Lines

Filtering the lines of the binary image according to a preset angle range and a preset length threshold to obtain a target line;

The title area between the first end of the target line and the first edge of the binary image when the length of the target line is greater than a first preset length or the length of the target line is greater than a second preset length If the sum of the width of the target line and the blank area between the second end of the target line and the second edge of the binary image and the length of the target line is greater than the first preset length, the target The line serves as the separator.
The method for extracting test paper information according to claim 1, wherein said obtaining the text line of the test paper image according to the layout area comprises:

Identifying the rectangular frame in the layout area;

Determine the width of the text line according to the width of the rectangular frame;

Determining a text box according to the width of the text line;

The vertical distance between the center point of the current text box and the center point of the previous text box is less than the first distance threshold, and the horizontal distance between the center point of the current text box and the center point of the previous text box is less than the second distance In the case of a threshold, merge the current text box and the previous text box to obtain a text line;

The vertical distance between the center point of the current text box and the center point of the previous text box is greater than or equal to the first distance threshold, or the horizontal distance between the center point of the current text box and the center point of the previous text box is greater than or When it is equal to the second distance threshold, both the current text box and the previous text box are regarded as one text line.
The method for extracting test paper information according to claim 6, wherein said determining the width of the text line according to the width of the rectangular frame comprises:

Filter the widths of all the rectangular frames according to the preset width range to obtain multiple target widths;

Counting the number of rectangular frames corresponding to each target width in the plurality of target widths;

The target width corresponding to the largest number of rectangular frames is selected as the text line width.
The method for extracting test paper information according to claim 6, before determining the width of the text line according to the width of the rectangular frame, the method further comprises:

The vertical distance between the center point of the current rectangular frame and the center point of the previous rectangular frame is less than the third distance threshold, and the horizontal distance between the center point of the current rectangular frame and the center point of the previous rectangular frame is less than the fourth distance In the case of a threshold, merge the current rectangular frame and the previous rectangular frame.
The method for extracting test paper information according to claim 1, before inputting the text image into a character recognition model to obtain the text information of the test paper image, the method further comprises:

Get text data and character data;

Encoding the text data and the character data to obtain a recognition dictionary; determining a text image set according to the text data;

According to the recognition dictionary and the text image set, the text recognition model is constructed.
The method for extracting test paper information according to claim 1, wherein the classification label includes a title, a big question, and a small question; the extracting test paper information of the target test paper image according to the classification label includes:

Determine the title text line, the big title text line and the small title text line respectively according to the classification key characters corresponding to the title, the big title and the small title respectively;

Extract the test paper information according to the title text line, the big question text line and the small question text line.
The method for extracting test paper information according to claim 10, wherein before extracting the test paper information based on the title text line, the big question text line and the small question text line, the method further comprises:

Perform coordinate information processing on the target test paper image;

When the abscissa of the subtitle text line exceeds the preset coordinate range, or the abscissa of the subtitle text line does not satisfy the sequence number increasing rule, the subtitle text line is deleted.
A test paper information extraction system, comprising a memory, a processor, and a computer program stored on the memory and capable of running on the processor. The processor executes the computer program as claimed in claims 1 to 11 Any one of the test paper information extraction methods.
A computer-readable storage medium that stores a computer program, and when the computer program is executed by a processor, realizes the method for extracting test paper information according to any one of claims 1 to 11.