CN110765907A

CN110765907A - System and method for extracting paper document information of test paper in video based on deep learning

Info

Publication number: CN110765907A
Application number: CN201910969725.4A
Authority: CN
Inventors: 严军峰; 邱英秋; 陈家海; 叶家鸣; 吴波
Original assignee: Anhui Seven Days Education Technology Co Ltd
Current assignee: Anhui Seven Days Education Technology Co Ltd
Priority date: 2019-10-12
Filing date: 2019-10-12
Publication date: 2020-02-07

Abstract

The invention relates to the technical field of image target detection and recognition, and discloses a system and a method for extracting paper document information of a test paper in a video based on deep learning, wherein the method mainly comprises document page extraction, chart detection, character area detection, character line detection, formula detection, OCR (optical character recognition) and post-processing; the system provides a new method for extracting paper document information in a video aiming at a test paper, fills up the blank of extracting the text information of the test paper in the video, and realizes convenient transcription of the document information of the test paper. The method mainly aims at shooting video data including common test paper such as mathematics, Chinese, English and the like, and realizes automatic information extraction from video analysis to electronic edition of the content of the test paper document. The extraction of the paper document information of the test paper in the video refers to shooting a section of commonly used multi-test paper video, and the method realizes the automatic extraction of the document information of each test paper in the video, thereby realizing the automatic conversion from the video data of the test paper to the electronic edition.

Description

System and method for extracting paper document information of test paper in video based on deep learning

Technical Field

The invention relates to the technical field of image target detection and identification, in particular to a system and a method for extracting paper document information of a test paper in a video based on deep learning.

Background

In recent years, with the continuous development of artificial intelligence technology, the deep learning technology based on the convolutional neural network has a great variety of algorithms in the fields of image/voice recognition, image classification, video analysis, target tracking and the like, and has a great effect exceeding that of human beings, thereby greatly improving the image processing technology. Many research achievements supported by deep learning are widely applied to scenes such as face recognition, video monitoring, unmanned driving, unmanned supermarkets and the like. Currently, with the development of deep learning technology, more artificial intelligence taking deep learning as a core technology has come to the ground in the industries of transportation, medical treatment, education and the like.

At present, artificial intelligence products based on deep learning are gradually falling to the ground in the education industry and are mainly applied to scenes such as campus security protection, test paper image-text separation, test paper image-text recognition and the like. The test paper document has important knowledge of test questions, knowledge point collection, difficulty degree distribution, examination point distribution and the like, the extraction and analysis of the test paper document information are beneficial to statistical analysis of a large number of test paper knowledge of the same type, the selective reorganization of questions by teachers is facilitated, and the problem of information transmission and storage is facilitated to be solved. The method has the advantages that the method provides a scheme for solving the problems, a large number of test paper documents to be extracted are photographed in videos, the videos are analyzed by means of a deep learning technology, end-to-end batch test paper document information extraction of multiple test papers, one-time photographing and one-time processing is achieved, storage and transmission of the test paper document information are greatly simplified, and data support is provided for large-scale subject database construction of follow-up knowledge point division, test question analysis and the like.

The method integrates various existing deep learning technologies to realize the task of extracting the paper document information of the test paper in the video according to the characteristics of the test paper document, can conveniently analyze the paper document of the test paper acquired in the video form, and completely extracts the information in each test paper document in the video, thereby simplifying the storage and transmission of the paper document information of the test paper and realizing the task of extracting the paper document information of the test paper in end-to-end batch with a plurality of test papers, one-time shooting and one-time processing.

Disclosure of Invention

Technical problem to be solved

The invention provides a method for extracting paper document information in a video based on deep learning, aiming at the problems of extracting the paper document information of a test paper from the video and storing and transmitting the test paper document, wherein the deep learning technology is introduced into the video analysis aiming at the test paper, so that the extraction of the paper document information of the test paper from the video is realized, the problems of single extraction and information transmission and storage of the paper document information of the test paper are solved, and the automatic extraction efficiency of the test paper information is greatly improved.

(II) technical scheme

In order to achieve the purpose, the invention provides the following technical scheme: a method for extracting paper document information of test paper in a video based on deep learning is characterized by comprising the following steps: the system is based on a deep learning technology and mainly comprises the steps of document page extraction, chart detection, character region detection, character line detection, formula detection, OCR recognition, post-processing and the like.

Preferably, the main features are specifically described as follows: the document page extraction algorithm mainly analyzes the shot test paper page turning video, extracts all different test paper document pages from the video, and the number of output test paper document pages is consistent with that of test paper pages shot by a user. The chart detection step mainly detects pictures or tables of all the test paper pages extracted in the previous step, aims to locate the test paper pages containing the chart area and is convenient for filtering the document information of the area in the post-processing process. The text region detection step aims at positioning a text region from each extracted test paper page, so that noise regions caused by different distances of lenses in the test paper shooting process are filtered, and only the test paper document region needing to be extracted is reserved. The character line detection step aims at detecting all character lines from the detected document page, and the detection process comprises the detection of all character lines which are frequently appeared in test paper containing Chinese, English and the like. The formula detection step is responsible for detecting whether each character line contains a formula or not, and provides basis for subsequent character recognition and formula recognition. The OCR recognition step is divided into character recognition and formula recognition, and is responsible for recognizing characters or formulas in all input sequences. And the post-processing step integrates the chart detection and OCR recognition results and recombines and outputs the extracted document information.

Preferably, the document page extraction is specifically described as: analyzing the shot test paper document video to obtain each frame image in the video, and then using lightweight Mobilenetv2 as a classification network to judge whether each frame image in the video is a document page. In the method, a non-document page is specified to be 0, a document page is specified to be 1, in the process of shooting a test paper document video, a photographer can shoot a plurality of different test paper documents, stable shooting contents of the test paper page and non-test paper page shooting contents exist in the middle, the stable shooting duration of each test paper is not necessarily the same, the frame rate of a shooting device is fixed, and a multi-frame image can be generated every second, so that through document page extraction, a series of continuously arranged 1 or 0, such as a [1111110000011111000000] sequence, can be extracted from the video, and represents a classification value of each frame of image in a section of video after passing through a Mobilentv 2 network, and 2 test paper pages exist in the video can be known from the sequence. And finally, finding out a frame image which is most clear in shooting in each single section and outputting the frame image as the image of the current test paper. All the shot paper document frame images of the test paper in the video can be obtained through the document page extraction step, and the frame images are analyzed and information is extracted subsequently.

Preferably, the graph test is specifically described as: and in the step, the images are analyzed, and pictures and tables in the images are detected by adopting an Faster R-CNN network, so that the coordinate information of the chart area in the test paper page is obtained, and reference is provided for post-processing of a subsequent identification result.

Preferably, the text region detection is specifically described as: and performing character region detection on the extracted image by using an SSD algorithm, and aiming at filtering out noise data interference of non-character regions. Due to the problems of test paper background sundries and the like caused by the fact that the distance between lenses is far and near in the shooting process of the video image, the detection of the text area of the video image can reduce the range of the image to be processed subsequently, reduce the influence of noise data and improve the accuracy of extracting the document information of the test paper.

Preferably, the text line detection is specifically described as: a text line detection algorithm PixelLink in a natural scene is used as a text line detection network, and the test paper images are difficult to locate if a conventional target detection method is used in consideration of the fact that the test paper images have angle inclination caused by the instability of a video shooting process. Therefore, the method selects a four-point positioning algorithm PixelLink in a natural scene to detect the character lines, even if the character area is inclined, the method detects four point coordinates of each character line, and the character lines can be aligned through perspective transformation to input subsequent OCR recognition standard data. In the step, the text area detected in the previous step is used as input, and all text lines detected by the PixelLink are output as a result.

Preferably, the formula detection is specifically described as: this step uses the EAST algorithm as a formula detection network with the goal of analyzing the image of each line of text detected in the previous step to determine if the line of text contains a formula, so as to crop out the formula area from the line of text for individual recognition. This step is based on EAST detection network of multi-feature joint prediction. The formula is used as a small target object, the prediction from feature maps with different scales is needed to ensure that all the formula areas are detected, and the EAST network combines 4 feature maps with different scales to detect the target object, so that the formula in the character line can be accurately detected.

Preferably, the OCR recognition is specifically described as: the aim of the step is to identify all characters in the character line and the detected formula, thereby realizing the extraction of the document information of the test paper image. Here, OCR recognition is classified into two types of character recognition and formula recognition, and because the character recognition and formula recognition algorithms are different, separate recognition is required. In the step, the position coordinates of the character area and the formula area in the character line picture can be obtained through character line detection and formula detection, the corresponding area is cut separately from the original picture according to the coordinates, the character area is input to a character recognition engine for recognition, the formula area is input to a formula engine for recognition, and all characters and formulas in the test paper are recognized through two separate branches.

Preferably, the post-treatment is specifically described as: and according to the results of chart detection, character recognition, formula recognition and the like, rearranging the recognition result and outputting the recognition result according to the layout of the original test paper to obtain the information finally extracted from the paper document of the test paper.

Preferably, the method comprises the following specific steps:

step one, simulating training data: the method aims at all detection and identification models related to the process of extracting paper document information of test paper in a video, 7 different models need to be trained independently, each model needs a large amount of training data as support, all training data are generated by adopting a program in the method, the training data of the detection model select a background picture at random, then lines, charts and formulas are added according to different models, and the information of adding position coordinates is recorded. The simulated picture and the label file have the same name, the label file records the position coordinates of a formula, a character line or a chart in the corresponding picture, the storage form is [ xmin ymin xmax ], and a plurality of detection targets are sequentially added according to the form. The OCR training data simulation process corresponds to indexes of characters written in pictures in a dictionary table in a label file, the label of the training data is identified to be in a latex format by a formula, and the data preparation process simulates more than 100 pieces of samples.

Step two, data preprocessing: abundant data enhancement can greatly improve the generalization capability of the model, and all data are subjected to random cutting, rotation, fuzzification and other operations to different degrees in the training process. In the training data, the height of an OCR recognition input image is 32 pixels, the size of a document page extraction and chart detection input image is 224x224, the size of a text line detection input image is 1024x768, the size of a text area detection input image is 608x608, the size of a formula detection input image is 1280x192, all images are normalized to be between-1 and 1, the training process takes the Batchsize as basic input, each Batchsize is randomly selected from an original image, and the OCR recognition input data are unified into a gray scale.

Step three, training a neural network: according to the steps, the Mobilenetv2, FasterR-CNN, SSD, PixelLink, EAST and OCR recognition models are trained in sequence, the step integrally adopts an end-to-end training mode, and network hyper-parameters are set as follows:

(1) and learning rate: the initial learning rate was set to 0.01, a 10% reduction per 10 rounds of training;

(2) and an optimizer: adam or sgd optimizer (implementation process is decided according to model training condition);

(3) and the other: the batch processing size is set to be 8 and is different according to the display memory size; the total number of training rounds is 200;

step four, post-treatment: and converting the models into pb files, sequentially splicing the pb files, outputting the previous model as the next model input, filtering the repeated test paper page identification result and outputting the result as it is, and finally outputting the extracted document information by taking a test paper document video as the input through the series of steps.

(III) advantageous effects

The invention provides a method for extracting paper document information of a test paper in a video based on deep learning, which has the following beneficial effects:

(1) the invention provides a method for extracting paper document information of a test paper in a video based on deep learning aiming at the current situation. The extraction of the paper document information of the test paper in the video refers to shooting a section of commonly used one or more test paper document videos, and the method realizes the automatic extraction of the document information of each test paper in the videos, thereby realizing the automatic conversion from the test paper video data to the electronic edition. By introducing the deep learning technology into the extraction of the test paper document information in the video, the extraction of the test paper document information from the video is realized, the problems of single extraction and information transmission and storage of the test paper document information are solved, and the automatic information extraction efficiency of the test paper is greatly improved.

(2) According to the invention, a deep learning technology is introduced into the extraction of the test paper document information, so that the extraction of the paper document information of the test paper in the video is realized, and aiming at the characteristics of the video data of the test paper, the whole process of the extraction of the paper document information of the test paper in the video based on various deep learning technical methods is creatively provided by the whole integration of the target detection and the OCR recognition method in the existing deep learning technology, so that the extraction of the paper document information of the test paper in the video of various complex test papers including mathematical test papers can be completed, and powerful data support is provided for the subsequent large-scale test paper recombination analysis and knowledge point judgment.

Drawings

FIG. 1 is a flow chart of the overall implementation of the present invention;

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Example (b):

as shown in fig. 1, the present invention provides a technical solution: a method for extracting paper document information of test paper in a video based on deep learning comprises the following parts:

the document page extraction section: firstly, reading a shot video test paper containing one or more test papers by adopting a video Capture in opencv, and storing all the read video frames into a frame _ list. And respectively inputting each frame image into a Mobilenetv2 network for 2 classification judgment of shooting or non-shooting, wherein if the current frame is a stable shooting frame facing to the test paper, the network outputs 1, and otherwise, 0 is output. After the network is carried out, frame _ result _ list with the same length as the frame _ list is obtained, and the list value is 1 or 0, which indicates whether the image frame at the corresponding position in the frame _ list is a stable shooting frame. And finally, analyzing and processing the list, wherein the method takes corresponding images with at least 5 continuous 1 s connected together in the list, and takes the at least 5 continuous 1 s as all frames of a page of data by considering instability and misjudgment of a model prediction process. After obtaining a plurality of frames of a plurality of test paper pages, determining the most clear and stable frame to be shot by calculating the definition of a plurality of image frames corresponding to the same test paper, and finally obtaining all the shot paper document image data of the test paper in the video. The partial model input image size is 224x224 and the training process randomly rotates and scales the image.

The chart detection section: it is mainly described how to detect the picture and table area from the above-obtained test paper page image. The chart detection target is to determine the position of a chart area in the test paper image and filter part of information during subsequent OCR recognition result recombination and result restoration. The graph detection network model uses a better detection effect Faster R-CNN network, the network inputs the original frame image resize to 224x224, and the network finally outputs all detected graph area coordinates and classification values. All coordinates in the method are converted into coordinates relative to the original frame image in the post-processing process, so that the required image can be conveniently cut from the original image without deformation, noise data is randomly added to the image in the training process, and the generalization capability of the model is improved.

A character region detection section: the method mainly describes that all character area ranges are detected from original frames corresponding to a test paper document page, and aims to narrow the detection range of subsequent character lines and eliminate noise data caused by shooting and the like. In the method, the SSD algorithm is used for detecting the character area, the size of an input image in the process of training a model is 608x608, the input image is an original video frame, and random plus-minus 5-degree rotation is performed in the training process. After the algorithm backbone network obtains a plurality of feature maps by using VGG, each feature map is sequentially convolved by 1x1 and 3x3 to generate a new feature map, character areas are respectively and independently predicted on the sequentially obtained 5 feature maps, and finally the global NMS outputs the finally predicted character area range. And in the post-processing process, cutting the original frame image according to the predicted character area coordinates, wherein the cut image is input of a character line detection part in the next step, and in the training process, the input image is randomly reduced and enlarged, and fuzzification, contrast ratio and other operations are carried out.

The character line detection section: it is mainly described how to detect all lines of text from the test paper image, the input being the previous part output. In the method, the text line detection uses a pixel segmentation-based PixelLink algorithm in a natural scene, the algorithm predicts a target object based on four-point positioning, and when a text area has a certain angle of inclination, the four-point coordinates of the detected text line can still position the text line. Considering the aspect ratio of the test paper in practice, the method will input the image resize to 1024x 768. And the algorithm backbone network uses a VGG basic framework, after the characteristics are extracted, the point and 1x1 convolution and UpSample are carried out from back to front in sequence to the same scale as the previous characteristic graph, and the fusion characteristic graph used for text and non-text prediction is obtained after continuous fusion. The network carries out text and non-text classification prediction and whether 8 directions of pixels are connected and predicted or not based on the characteristics, finally, text connected domains with different sizes can be obtained based on operation of connected domains, namely MinAreaRect based on Opencv, then, through noise filtering operation, and finally, through merging and gathering, a final text box is obtained. After position coordinate information of text box character lines is obtained, the coordinate information is restored to an original video frame image corresponding to the input image, a corresponding character line image is cut in the video frame to be used as input of next formula detection, if the character area has the conditions of inclination and the like, coordinates of four points of the output character lines are detected to be quadrilateral, and if the maximum external rectangle is directly calculated, errors of the cut character line image can be caused, so that the step firstly carries out perspective transformation on the detected coordinates of the four points in the original frame area, thereby ensuring that the cut character line image is horizontally arranged, and using the coordinate information as input of a next formula detection and OCR identification part.

The formula detection part: it is mainly described how to detect whether formula exists from the cut text line picture. The formula detection aims at independently cutting and identifying the formula from the character line, and the formula identification is different from the traditional OCR character identification method and cannot directly identify the formula by using the character identification method, so that the formula needs to be independently detected. According to formula detection, an EAST network is used, a network inputs a picture with the size of 1280x192, four Feature maps can be obtained through convolution layers in four stages, the Feature maps are respectively reduced by 1/4,1/8,1/16 and 1/32 relative to the input, then upsampling is used, Concat and convolution operations are carried out to sequentially obtain four different Feature maps, after the last fusion Feature Map is obtained, a convolution kernel of 3x3x32 is used to obtain a final Feature Map, a text rotation angle is obtained through convolution of 1x1x1 on the final Feature Map, and text blocks are obtained through convolution of 1x1x 4. And the output result is the coordinates of four points of the formula position in the character line, the formula image at the corresponding position is cut in the character line image in the post-processing process, the rest part of the cut formula image in the character line image is a character recognition image, and the character recognition image are used as OCR recognition input in the next step.

An OCR recognition section: the method mainly describes how to recognize characters in a character area and a formula area, and comprises two recognition engines, wherein one recognition engine is a traditional OCR recognition algorithm and recognizes characters, numbers and the like, the other recognition engine is an algorithm specially recognizing a formula, the two recognition engines adopt the basic framework of CNN + LSTM, the formula recognition additionally uses an Attention mechanism, and the character recognition algorithm uses CTC to calculate loss Function. In the method, the height of the character line input by the character recognition model is 32x280, and the character length is between 5 and 15 characters. The image size after formula identification input formula detection is not fixed, the formula image height is higher than the character line height, the formula identification result is output in a latex format, and the image data can be displayed only by a post-processing part.

And (3) post-treatment: the method aims to realize the extraction of paper document information of a test paper in a video, not only transcribes the content of a test paper image into an electronic version, but also needs to output the document information extraction result according to the original image original layout, so that the post-processing process sorts the results according to Y-first and X-second coordinates according to the coordinate position information of a detection target according to chart detection, character line detection, formula detection and OCR recognition results, and finally adds the formula recognition result to the corresponding position according to the position of the formula coordinate in the character line and performs global optimization processing. And in the post-processing process, the influence of repeated test paper pages is considered, the OCR results are analyzed and compared frame by frame, and the current page is filtered when the same recognition result appears.

A method for extracting paper document information of test paper in a video based on deep learning comprises the following steps:

The invention aims at the test paper image, realizes the extraction of the test paper quality document information in the video by a deep learning method, realizes the automatic extraction of the video test paper document information from the video type test paper data by the method, and lays a foundation for the construction of a large-scale test paper database.

In summary, the present invention provides a method for extracting paper document information from a video based on deep learning, which mainly aims at shooting video data including common test papers such as mathematics, Chinese, english, etc., and realizes automatic information extraction of test paper document contents from video analysis to electronic edition. The extraction of the paper document information of the test paper in the video refers to shooting a section of commonly used multi-test paper video, and the method realizes the automatic extraction of the document information of each test paper in the video, thereby realizing the automatic conversion from the video data of the test paper to the electronic edition. Aiming at the extraction characteristics of the test paper document information in the video, the method provides an integrated whole flow solution of detection and identification by using various image processing technologies based on deep learning, and realizes a new one-stop automatic solution method for completing the extraction of the test paper document information from the video. The method can well extract the information of the test paper aiming at different types of questions, particularly in the scenes that the test paper questions contain charts, formulas and the like. The method integrates various existing deep learning technologies to realize the task of extracting the paper document information of the test paper in the video according to the characteristics of the test paper document, can conveniently analyze the paper document of the test paper acquired in the video form, and completely extracts the information in each test paper document in the video, thereby simplifying the storage and transmission of the paper document information of the test paper and realizing the task of extracting the paper document information of the test paper in end-to-end batch with a plurality of test papers, one-time shooting and one-time processing.

It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus.

Although embodiments of the present invention have been shown and described, it will be appreciated by those skilled in the art that changes, modifications, substitutions and alterations can be made in these embodiments without departing from the principles and spirit of the invention, the scope of which is defined in the appended claims and their equivalents.

Claims

1. A system and a method for extracting paper document information of test paper in a video based on deep learning are characterized in that: the system is based on a deep learning technology and mainly comprises the steps of document page extraction, chart detection, character region detection, character line detection, formula detection, OCR recognition, post-processing and the like.

2. The method for extracting paper document information in video based on deep learning as claimed in claim 1, wherein: the main features are described in detail as follows: the document page extraction algorithm mainly analyzes the shot test paper page turning video, extracts all different test paper document pages from the video, and the number of output test paper document pages is consistent with that of test paper pages shot by a user; the chart detection step mainly carries out picture or table detection on all the test paper pages extracted in the previous step, aims at positioning the test paper pages containing the chart area and is convenient for filtering the document information of the area in the post-processing process; the text region detection step aims at positioning a text region from each extracted test paper page, so that noise regions caused by different distances of lenses in the test paper shooting process are filtered, and only the test paper document region needing to be extracted is reserved. The character line detection step aims at detecting all character lines from a detected document page, and the detection process comprises the detection of all character lines which often appear in test paper containing Chinese, English and the like; the formula detection step is responsible for detecting whether each character line contains a formula or not, and provides a basis for subsequent character recognition and formula recognition; the OCR identification step is divided into character identification and formula identification and is responsible for identifying characters or formulas in all input sequences; and the post-processing step integrates the chart detection and OCR recognition results and recombines and outputs the extracted document information.

3. The method for extracting paper document information in video based on deep learning as claimed in claim 1, wherein: the document page extraction is specifically described as: analyzing the shot test paper document video to obtain each frame of image in the video, and then using lightweight Mobilenetv2 as a classification network to judge whether each frame of image in the video is a document page; in the method, a non-document page is specified to be 0, a document page is specified to be 1, in the process of shooting a test paper document video, a photographer can shoot a plurality of different test paper documents, stable shooting contents of the test paper page and non-test paper page shooting contents exist in the middle, the stable shooting duration of each test paper is not necessarily the same, the frame rate of shooting equipment is fixed, and a multi-frame image can be generated every second, so that through document page extraction, a series of continuously arranged 1 or 0, such as a [1111110000011111000000] sequence, can be extracted from a video, represents a classification value of each frame of image in a section of video after passing through a Mobilentv 2 network, and the sequence is used for knowing that 2 test paper pages exist in the video; analyzing the sequence to obtain a sequence with two segments as 1 and corresponding frame numbers, and finally searching a frame image which is most clear in shooting in each single segment to be used as the image output of the current test paper; all the shot paper document frame images of the test paper in the video can be obtained through the document page extraction step, and the frame images are analyzed and information is extracted subsequently.

4. The method for extracting paper document information in video based on deep learning as claimed in claim 1, wherein: the chart detection is specifically described as follows: and in the step, the images are analyzed, and pictures and tables in the images are detected by adopting an Faster R-CNN network, so that the coordinate information of the chart area in the test paper page is obtained, and reference is provided for post-processing of a subsequent identification result.

5. The method for extracting paper document information in video based on deep learning as claimed in claim 1, wherein: the text region detection is specifically described as follows: performing character region detection on the extracted image by using an SSD algorithm, and aiming at filtering noise data interference of a non-character region; due to the problems of test paper background sundries and the like caused by the fact that the distance between lenses is far and near in the shooting process of the video image, the detection of the text area of the video image can reduce the range of the image to be processed subsequently, reduce the influence of noise data and improve the accuracy of extracting the document information of the test paper.

6. The method for extracting paper document information in video based on deep learning as claimed in claim 1, wherein: the text line detection is specifically described as follows: a text line detection algorithm PixelLink in a natural scene is used as a text line detection network, and the test paper images are difficult to locate if a conventional target detection method is used in consideration of the fact that the test paper images have angle inclination caused by the instability of a video shooting process. Therefore, the method selects a four-point positioning algorithm PixelLink in a natural scene to detect the character lines, even if the character area is inclined, the step detects four point coordinates of each character line, the character lines can be aligned through perspective transformation, the character area detected in the step is used as input for inputting subsequent OCR recognition standard data, and all the character lines detected by the PixelLink are output as a result.

7. The method for extracting paper document information in video based on deep learning as claimed in claim 1, wherein: the formula detection is specifically described as: this step uses the EAST algorithm as a formula detection network with the goal of analyzing the image of each line of text detected in the previous step to determine if the line of text contains a formula, so as to crop out the formula area from the line of text for individual recognition. This step is based on EAST detection network of multi-feature joint prediction. The formula is used as a small target object, the prediction from feature maps with different scales is needed to ensure that all the formula areas are detected, and the EAST network combines 4 feature maps with different scales to detect the target object, so that the formula in the character line can be accurately detected.

8. The method for extracting paper document information in video based on deep learning as claimed in claim 1, wherein: the OCR recognition is specifically described as: the aim of the step is to identify all characters in the character line and the detected formula, thereby realizing the extraction of the document information of the test paper image. The OCR recognition is divided into character recognition and formula recognition, and because the character recognition algorithm and the formula recognition algorithm are different, the OCR recognition needs to be recognized independently; in the step, the position coordinates of the character area and the formula area in the character line picture can be obtained through character line detection and formula detection, the corresponding area is cut separately from the original picture according to the coordinates, the character area is input to a character recognition engine for recognition, the formula area is input to a formula engine for recognition, and all characters and formulas in the test paper are recognized through two separate branches.

9. The method for extracting paper document information in video based on deep learning as claimed in claim 1, wherein: the post-treatment is specifically described as: according to the results of chart detection, character recognition, formula recognition and the like, rearranging the recognition results and outputting the recognition results according to the layout of the original test paper to obtain the information finally extracted from the paper document of the test paper; the method comprises the following specific steps:

step one, simulating training data: the method faces to all detection and identification models related to the extraction process of paper document information of test paper in a video, 7 different models need to be trained independently, each model needs a large amount of training data as support, all training data are generated by adopting a program in the method, the training data of the detection model select a background picture at random, then lines, charts and formulas are added according to different models respectively, and the information of adding position coordinates is recorded; the simulated picture and the label file have the same name, the label file records the position coordinates of a formula, a character line or a chart in the corresponding picture, the storage form is [ xminxmax ], and a plurality of detection targets are sequentially added according to the form; the simulation process of OCR training data corresponds to indexes of characters written in pictures in a dictionary table in a label file, the label of the training data is identified to be in a latex format by a formula, and the data preparation process simulates more than 100 pieces of samples respectively;