CN110705398A

CN110705398A - Mobile-end-oriented test paper layout image-text real-time detection method

Info

Publication number: CN110705398A
Application number: CN201910884273.XA
Authority: CN
Inventors: 严军峰; 吕达; 陈家海; 叶家鸣; 吴波
Original assignee: Anhui Seven Days Education Technology Co Ltd
Current assignee: Anhui Seven Days Education Technology Co Ltd
Priority date: 2019-09-19
Filing date: 2019-09-19
Publication date: 2020-01-17

Abstract

The invention relates to the technical field of image target detection, and discloses a test paper layout image-text real-time detection method facing a mobile terminal, wherein the system is designed based on a MobileNet V2 and a PeleNet network architecture and mainly comprises a simulation data generation part, a picture feature extraction part, a ResBlock module and a joint prediction and post-processing part; according to the invention, the test paper image data is obtained from the mobile terminal equipment (a camera or a photo, etc.), the text and picture areas in the image data are detected in real time through the tflite format target detection model built in the mobile terminal, and a series of processes of obtaining a model output result from the data are completed at the mobile terminal without network transmission, so that the data network transmission time is saved, the characteristics of rapidness, low delay, high performance, etc. are achieved, and the user experience of a mobile terminal user is greatly improved.

Description

Mobile-end-oriented test paper layout image-text real-time detection method

Technical Field

The invention relates to the technical field of image target detection, in particular to a test paper layout image-text real-time detection method facing a mobile terminal.

Background

The target detection is an important application field in image processing, and is widely applied to the fields of intelligent transportation, security, medical treatment, education and the like. With the popularization of mobile devices, the existing target detection model cannot be deployed at a mobile terminal for real-time detection due to the defects of large parameter quantity, time consumption of forward propagation and the like, so that how to perform a real-time target detection task at the mobile terminal becomes a focus. In the analysis of the test paper layout, how to complete the real-time detection of the images and texts in the test paper layout on mobile devices such as mobile phones becomes a hot demand due to the difference of the used objects. The method provides a novel network architecture combining MobileNet V2 and PeleNet, the network model has small volume and parameter quantity, and can be deployed at a mobile terminal to perform real-time detection on the image and text of the test paper layout.

At present, in the test paper layout image-text detection, most of the test paper layout image-text detection is based on general target detection frameworks such as yolo, ssd, fast rcnn and the like, but the models have large parameter quantity and consume time in forward propagation, the volume, the storage space and other resources of mobile terminal equipment in a real scene are limited, the deployment models require less parameter quantity and high precision, and the resources are occupied as little as possible, so that the mobile terminal is not suitable for deploying the general target detection models, and a target detection algorithm for the operation of the mobile terminal needs to be specially developed and designed.

Disclosure of Invention

Technical problem to be solved

Aiming at the defects of the prior art, the invention provides a test paper layout image-text real-time detection method facing a mobile terminal, and solves the problem that the existing test paper layout image-text detection algorithm model cannot be deployed at the mobile terminal for real-time detection due to the defects of large parameter quantity, time consumption of forward propagation and the like.

(II) technical scheme

In order to achieve the purpose, the invention provides the following technical scheme: a test paper layout image-text real-time detection method facing a mobile terminal is based on the MobileNet V2 and PeleNet network architecture design and mainly comprises a simulation data generation part, a picture feature extraction part, a ResBlock module and a joint prediction and post-processing part.

Preferably, the simulation program in the automatic generation of the simulation data generates required batch training data, the generated simulation data is highly similar to real sample data in principle, and the simulation program can automatically generate a plurality of layout test papers containing various common styles by specifying the required total sample size.

Preferably, the picture feature extraction uses lightweight MobileNetV2 to extract picture features as a real-time detection model deployed facing a mobile terminal, and since MobileNetV2 is a lightweight network facing the mobile terminal, the model can perform classification and detection tasks at the same time, so that the quantity of parameters can be reduced from the main structure level of the network by using MobileNetV2 to extract picture features. The method improves the method, removes the last two conv2d and avgpool layers in the original network, and reserves the 3 rd, 4 th, 5 th, 6 th and 7 th bottleeck output characteristic diagrams for subsequent fusion. The 5 different states have different reserved feature maps, which correspond to the predictions of the objects with different sizes, and have different texture, edge, and other information. Meanwhile, the number of output channels of the last bottleeck block is reduced to 16, so that the calculation amount of the last layer is reduced.

Preferably, the ResBlock refers to the inclusion v3 and peloenet network, and the 5 extracted feature maps for detection are used to construct a ResBlock before prediction is performed, in which the 3x3 convolution is replaced by two series-connected 1x3 and 3x1 size convolutions. According to the Inception V3 idea, a large two-dimensional convolution is split into two small one-dimensional convolutions, so that on one hand, a large amount of parameter acceleration operation is saved, and overfitting is relieved. Meanwhile, the expression capability of a layer of nonlinear extension model is increased, more and richer spatial features can be processed, and the feature diversity is increased. Meanwhile, a MAX-POOL layer is added before the original 1x1 convolution (according to the Inception V3 idea, the detection effect can be improved by adding the MAX-POOL layer), the number of channels is reduced by utilizing the original 1x1 convolution, and a 5x5 convolution branch is added to enable the MAX-POOL layer to be close to an Inception V3 network submodule. And finally, splicing the 3 different branch characteristic graphs in a dimension channel, and outputting a feature map for detection and classification.

Preferably, the joint prediction part performs classification and detection tasks on the feature map behind each ResBlock block respectively and independently, and the final detection result and the target class can be output by the global NMS for the detection task.

Preferably, the post-processing process completes the network architecture based on TensorFlow, a plurality of ckpt ending model files are obtained through training, the files are usually large and cannot be directly deployed on the mobile device, the model which is finally trained needs to be converted into a pb file in the post-processing process, the pb file is only about 1/2 of the original model file in size, and meanwhile the pb file is used for testing the model effect on the picture. The mobile terminal model uses tflite, so that the pb file is converted into a tflite end file in the last step, and the size of the file is 1/4 which is about the size of the original model, so that the mobile terminal model is very suitable for being deployed at the mobile terminal. Before deployment, it is necessary to verify whether the performance of the converted tflite file is lost. The verification method comprises the following steps: and respectively inputting the same picture into pb and tflite for prediction (ensuring the input is the same), and comparing whether the numerical values output by the two models are consistent or not, wherein consistency verification is also called. Generally speaking, the pb and the first six bits of the tflite output value are kept consistent, and the difference is generated from the seventh bit, which means that the difference in the final result is small, so that in the consistency verification process, the two models are considered to have consistent performance as long as the first six bits of the comparison value are consistent, and the tflite file model is smaller in size on the premise of consistent performance, so that the tflite file model is suitable for being deployed at a mobile terminal.

A test paper layout image-text real-time detection method facing a mobile terminal comprises the following specific steps:

s1: simulation training data: the method is a mobile-end-oriented test paper layout image-text real-time detection method, and aims to design a mobile-end-oriented test paper layout image-text real-time detection method. In the image-text detection, a text area and a picture area (in the invention, pictures and tables appearing in the test paper layout are collectively referred to as pictures) in the test paper layout need to be detected at the same time, and a classification label of the detection area as the text or picture area is given, so that in the process of simulating a training sample by using a simulation program in the method, label information of data needs to be recorded into a txt file with the same name as the picture. And storing each line of information in the txt file according to the form of [ xmin, ymin, xmax, ymax, label ], wherein the value of label is 0 or 1, 0 represents a character area, and 1 represents a picture area. In the process of simulating the picture, different test paper layout types are considered strictly according to the test paper layout type standard, the image data of the double-column layout is simulated according to the probability larger than 0.8, the possible distribution range of the picture and the characters is considered, and the like.

S2: data preprocessing: integrating the image data of each layout in the simulated test paper and the corresponding label files into the train.txt, the test.txt and the val.txt, randomly selecting and storing according to the ratio of 8:1:1, wherein each file sequentially comprises a picture path, coordinate information and label information, and each line represents one layout image, coordinates of all characters and image position information in the image and corresponding label.

S3: training a neural network: integrating the network structure according to the description framework to generate a new mobile-end-oriented test paper layout image-text detection algorithm, integrally adopting an end-to-end training mode, and setting network hyper-parameters as follows:

(1) and learning rate: the initial learning rate was set to 0.01, a 10% reduction per 10 rounds of training;

(2) and an optimizer: using adam or sgd optimizer (implementation process is decided according to model training condition);

(3) and the other: the size of the batchsize is set to be 8, the size is related to the video memory capacity, and the total number of training rounds is 200;

s4: and (3) model prediction output: and selecting an optimal model, converting the optimal model into a pb format file, testing the effect in a verification set by using a pb model, converting the pb file into a tflite model file after the effect reaches the standard, verifying the effect consistency of the pb and tflite files, and taking the tflite file after verification as the model file finally deployed at the mobile terminal.

(III) advantageous effects

The invention provides a test paper layout image-text real-time detection method facing a mobile terminal, which has the following beneficial effects:

according to the invention, the test paper image data is obtained from the mobile terminal equipment (a camera or a photo, etc.), the text and picture areas in the image data are detected in real time through the tflite format target detection model built in the mobile terminal, a series of processes such as obtaining a model output result from the data are completed at the mobile terminal without network transmission, so that the data network transmission time is saved, the test paper image-text detection algorithm has the characteristics of rapidness, low delay, high performance, etc., the user experience of the mobile terminal is greatly improved, and the problem that the existing test paper image-text detection algorithm model cannot be deployed at the mobile terminal for real-time detection due to the defects of large parameter quantity, time consumption of forward propagation, etc. is solved.

Drawings

FIG. 1 is a flow chart of the overall implementation of the present invention;

FIG. 2 is a ResBlock structure diagram in the overall implementation flow of the present invention;

FIG. 3 is a block diagram of joint prediction in the overall implementation of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

As shown in fig. 1-3, the present invention provides a technical solution: a test paper layout image-text real-time detection method for a mobile terminal mainly comprises the steps of simulation data automatic generation, picture feature extraction, a ResBlock module, joint prediction and post-processing.

Specifically, the simulation program in the automatic generation of the simulation data generates the required batch training data, and the generated simulation data is highly similar to the real sample data in principle. By specifying the total sample size required, the simulation program can automatically generate a variety of layout test papers including various common styles.

Specifically, the simulation data automatic generation part: preparing a plurality of white background pictures as required data such as candidate background pictures and the like, firstly randomly determining a single-column or double-column format according to a test paper layout to be simulated, and sequentially writing character line information from left to right and from top to bottom, wherein the character line information is selected from a pre-prepared corpus, and a plurality of character lines form a character area. And simultaneously, filling pictures in the relevant range of the character area according to a certain probability, recording filling position coordinates, and writing the character area and the picture position information into a corresponding txt file. The method comprises the steps of starting to simulate training data after the total number of samples is specified in a program, simulating 100 pieces of data at the time, wherein 5 pieces of test sets and 5 pieces of verification sets are used, finally evaluating the effect of the model by taking the accuracy, the recall ratio and the mAP on the 5 pieces of verification sets as evaluation indexes, and using coco api as an evaluation tool.

Specifically, the picture feature extraction part: the feature extraction mainly extracts high-level features such as textures, edges and the like which accord with pictures and texts from an original image through a convolutional neural network. The method uses the MobileNet V2 network to extract features, and the network has the characteristic of light weight and is suitable for mobile terminal deployment. The method is improved, the last two conv2d and avgpool layers in the original network are removed, the 3 rd, 4 th, 5 th, 6 th and 7 th bottleeck output feature maps are reserved for subsequent fusion, the size of the 5 bottleeck output feature maps is 28x28, 14x14, 14x14, 7x7 and 7x7, the method performs 1x1/2 convolution operation after the 5 th, 6 th and 7 th bottleeck, so that the sizes of the 5 feature maps for prediction are changed into 28x28, 14x14, 7x7, 4x4 and 2x2, and the number of channels is reduced simultaneously, so that the calculation amount is reduced.

Specifically, the ResBlock section: referring to the inclusion v3 and peloenet network, the 5 signatures extracted above for detection, before performing prediction, construct a ResBlock in which the 3x3 convolution is replaced by two concatenated 1x3 and 3x1 size convolutions. According to the Inception V3 idea, a large two-dimensional convolution is split into two small one-dimensional convolutions, on one hand, a large amount of parameter acceleration operation is saved, overfitting is relieved, meanwhile, a layer of nonlinear expansion model expression capacity is added, more and richer space characteristics can be processed, and the characteristic diversity is increased. Meanwhile, a MAX-POOL layer is added before the original 1x1 convolution (according to the Inception V3 idea, the detection effect can be improved by adding the MAX-POOL layer), the number of channels is reduced by utilizing the original 1x1 convolution, 5x5 convolution branches are added, the number of the convolution branches is close to the Inception V3 network sub-module, and finally, the feature maps for detection and classification can be output by splicing 3 different branch feature maps in dimension channels.

Specifically, the combined prediction part obtains 5 feature maps for predicting classification and regression frames, refers to the SSD network, and performs prediction on the 5 feature maps respectively, and finally performs global NMS to output a final detection result and a target class.

Specifically, the post-processing process completes the network architecture based on TensorFlow, a plurality of ckpt ending model files are obtained through training, the files are usually large and cannot be directly deployed on the mobile device, the final trained model needs to be converted into a pb file in the post-processing process, and the pb file is only about 1/2 of the original model file in size. Meanwhile, the pb file is used for testing the model effect on the picture, and the mobile terminal model uses tflite, so that the pb file is converted into a tflite tail file in the last step, and the size of the file is about 1/4 of the size of the original model, so that the mobile terminal model is very suitable for deployment at the mobile terminal. Before deployment, whether the performance of the converted tflite file is lost needs to be verified, and the verification method comprises the following steps: and respectively inputting the same picture into pb and tflite for prediction (ensuring the input is the same), and comparing whether the numerical values output by the two models are consistent or not, wherein consistency verification is also called. Generally speaking, the pb and the first six bits of the tflite output value are kept consistent, and the difference is generated from the seventh bit, which means that the final result has no difference, so that in the consistency verification process, the two models are considered to have consistent performance as long as the first six bits of the comparison value are consistent, and on the premise of consistent performance, the tflite file model is smaller in size, so that the tflite file model is suitable for being deployed at a mobile terminal.

s1: simulation training data: the method is a mobile terminal-oriented test paper layout image-text real-time detection method, and aims to design a mobile terminal-oriented test paper layout image-text real-time detection method. In the image-text detection, a text area and a picture area (in the invention, pictures and tables appearing in the test paper layout are collectively referred to as pictures) in the test paper layout need to be detected at the same time, and a classification label of the detection area as the text or picture area is given, so that in the process of simulating a training sample by using a simulation program in the method, label information of data needs to be recorded into a txt file with the same name as the picture. And storing each line of information in the txt file according to the form of [ xmin, ymin, xmax, ymax, label ], wherein the value of label is 0 or 1, 0 represents a character area, and 1 represents a picture area. In the process of simulating the picture, different test paper layout types are considered strictly according to the test paper layout type standard, the image data of the double-column layout is simulated according to the probability larger than 0.8, the possible distribution range of the picture and the characters is considered, and the like.

s4: and (3) model prediction output: and selecting an optimal model, converting the optimal model into a pb format file, testing the effect in a verification set by using a pb model, converting the pb file into a tflite model file after the effect reaches the standard, verifying the effect consistency of the pb and tflite files, and verifying that the passed tflite file is the model file finally deployed at the mobile terminal.

In use, training data were simulated: the method is a mobile terminal-oriented test paper layout image-text real-time detection method, and aims to design a mobile terminal-oriented test paper image-text real-time detection method. In the image-text detection, a text area and a picture area (in the invention, pictures and tables appearing in the test paper layout are collectively referred to as pictures) in the test paper layout need to be detected at the same time, and a classification label that the detection area is the text or picture area is given. Therefore, in the process of simulating a training sample by using a simulation program in the method, the label information of the data needs to be recorded in a txt file with the same name as the picture, each line of information in the txt file is stored in the form of [ xmin, ymin, xmax, ymax, label ], the value of label is 0 or 1, 0 represents a character area, and 1 represents a picture area. In the process of simulating the picture, different test paper layout types are considered strictly according to the test paper layout type standard, the image data of the double-column layout is simulated according to the probability larger than 0.8, the possible distribution range of the picture and the characters is considered, and the like. Data preprocessing: integrating the image data of each layout in the simulated test paper and the corresponding label files into the train.txt, the test.txt and the val.txt, randomly selecting and storing according to the ratio of 8:1:1, wherein each file sequentially comprises a picture path, coordinate information and label information, and each line represents one layout image, coordinates of all characters and image position information in the image and corresponding label. Training a neural network: integrating the network structure according to the description framework to generate a new mobile-end-oriented test paper layout image-text detection algorithm, integrally adopting an end-to-end training mode, setting network hyper-parameters as follows, and learning rate: initial learning rate was set to 0.01, 10% reduction per 10 rounds of training, optimizer: using adam or sgd optimizer (implementation is decided by model training case), and others: the size of the batchsize is set to 8, which is related to the video memory capacity, and the total number of training rounds is 200. And (3) model prediction output: selecting an optimal model, converting the optimal model into a pb format file, testing the effect in a verification set by using a pb model, converting the pb file into a tflite model file after the effect reaches the standard, verifying the effect consistency of the pb and tflite files, and verifying that the passed tflite file is the model file finally deployed at the mobile terminal

In summary, the test paper image data is acquired from the mobile terminal device (camera or photographing, etc.), the text and picture areas in the image data are detected in real time through the tflite format target detection model built in the mobile terminal, a series of processes of acquiring the model output result from the data are completed at the mobile terminal without network transmission, so that the data network transmission time is saved, the test paper image-text detection algorithm has the characteristics of high speed, low delay, high performance, etc., the user experience of the mobile terminal is greatly improved, and the problem that the existing test paper image-text detection algorithm model cannot be deployed at the mobile terminal for real-time detection due to the defects of large parameter quantity, time consumption of forward propagation, etc. is solved.

It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus.

Although embodiments of the present invention have been shown and described, it will be appreciated by those skilled in the art that changes, modifications, substitutions and alterations can be made in these embodiments without departing from the principles and spirit of the invention, the scope of which is defined in the appended claims and their equivalents.

Claims

1. A test paper layout image-text real-time detection method facing a mobile terminal is characterized in that: the system is designed based on a MobileNet V2 and a PeleNet network architecture, and mainly comprises a simulation data generation part, a picture feature extraction part, a ResBlock module and a joint prediction and post-processing part.

2. The method for detecting the image and text on the test paper layout facing the mobile terminal in real time according to claim 1, wherein the method comprises the following steps: the simulation program generates required batch training data in the automatic generation of the simulation data, the generated simulation data is highly similar to real sample data in principle, and the simulation program can automatically generate various layout test papers containing various common styles by specifying the required total sample size.

3. The method for detecting the image and text on the test paper layout facing the mobile terminal in real time according to claim 1, wherein the method comprises the following steps: the picture feature extraction uses lightweight MobileNet V2 to extract picture features as a real-time detection model deployed towards a mobile terminal, and the MobileNet V2 is a lightweight network facing the mobile terminal, and the model can simultaneously perform classification and detection tasks, so that the quantity of parameters can be reduced from the main structure level of the network by using the MobileNet V2 to extract the picture features. The method improves the method, removes the last two conv2d and avgpool layers in the original network, reserves the 3 rd, 4 th, 5 th, 6 th and 7 th bottleeck output characteristic diagrams as subsequent fusion for use, and the 5 characteristic diagrams reserved in different states have different sizes, respectively correspond to the prediction of objects with different sizes, and have different texture, edge and other information, and simultaneously reduces the number of output channels of the last bottleeck block to 16, thereby reducing the calculation amount of the last layer.

4. The method for detecting the image and text on the test paper layout facing the mobile terminal in real time according to claim 1, wherein the method comprises the following steps: the ResBlock refers to the inclusion v3 and peloenet networks, the 5 signatures extracted above for detection, and constructs a ResBlock in which the 3x3 convolution is replaced by two concatenated 1x3 and 3x1 size convolutions before prediction is performed. According to the Inception V3 idea, a large two-dimensional convolution is split into two small one-dimensional convolutions, on one hand, a large amount of parameter acceleration operation is saved, overfitting is relieved, meanwhile, a layer of nonlinear expansion model expression capacity is added, more and richer space characteristics can be processed, and the characteristic diversity is increased. Meanwhile, a MAX-POOL layer is newly added before the original 1x1 convolution (according to the Inception V3 idea, the detection effect can be improved by adding the MAX-POOL layer), the number of channels is reduced by utilizing the original 1x1 convolution, 5x5 convolution branches are newly added, the included retentive V3 network sub-modules are close to each other, and finally, 3 different branch feature maps are spliced in dimension channels, so that feature maps for detection and classification can be output.

5. The method for detecting the image and text on the test paper layout facing the mobile terminal in real time according to claim 1, wherein the method comprises the following steps: and the joint prediction part is used for respectively and independently classifying and detecting tasks on the feature map behind each ResBlock block, and finally outputting a final detection result and a target class for the last global NMS of the detection tasks.

6. The method for detecting the image and text on the test paper layout facing the mobile terminal in real time according to claim 1, wherein the method comprises the following steps: the post-processing process completes the network architecture based on TensorFlow, and a plurality of ckpt ending model files are obtained through training, wherein the files are usually large and cannot be directly deployed in mobile equipment. The post-processing procedure needs to convert the finally trained model into a pb file, at this time, the size of the pb file is only about 1/2 of the original model file, and the pb file is used for testing the model effect on the picture. The mobile terminal model uses tflite, so that the pb file is converted into a tflite end file in the last step, and the size of the file is 1/4 which is about the size of the original model, so that the mobile terminal model is very suitable for being deployed at the mobile terminal. Before deployment, it is necessary to verify whether the performance of the converted tflite file is lost. The verification method comprises the following steps: the same picture is respectively input to pb and tflite for prediction (input is guaranteed to be the same), whether the values output by the two models are consistent or not is compared, consistency verification is also called, generally speaking, the first six bits of the values output by pb and tflite are consistent, the difference exists from the seventh bit, the difference is small in the final result, therefore, in the consistency verification process, the two models are considered to be consistent in performance as long as the first six bits of the values are consistent, and the tflite file model is smaller in size on the premise of consistent performance, so that the method is suitable for being deployed at a mobile terminal.

7. A test paper layout image-text real-time detection method facing a mobile terminal is characterized in that: the method comprises the following specific steps:

s1: simulation training data: the method is a test paper layout image-text real-time detection method facing a mobile terminal, and aims to design a test paper layout image-text real-time detection method facing the mobile terminal. Therefore, in the process of simulating the training sample by using the simulation program in the method, the label information of the data needs to be recorded in the txt file with the same name as the picture. Storing each line of information in the txt file according to the form of [ xmin, ymin, xmax, ymax, label ], wherein the value of label is 0 or 1, 0 represents a character area, 1 represents a picture area, different test paper layout typesetting is considered in the process of simulating the picture strictly according to the test paper layout typesetting standard, double-column layout image data is simulated according to the probability more than 0.8, the possible distribution range of the picture and the characters is considered, and the like;

s2: data preprocessing: integrating the image data of each layout in the simulated test paper and the corresponding label files into a train-in.txt, a test.txt and a val.txt, randomly selecting and storing the image data and the corresponding label files according to the ratio of 8:1:1, wherein each file sequentially comprises an image path, coordinate information and label information, and each line represents one layout image, all characters in the image, position information coordinates of the image and corresponding label;