CN109948609A

CN109948609A - Intelligently reading localization method based on deep learning

Info

Publication number: CN109948609A
Application number: CN201910168207.2A
Authority: CN
Inventors: 桂冠; 邵蕾; 李懋阳; 刘超; 熊健; 杨洁; 孙颖异; 孟洋
Original assignee: Nanjing Post and Telecommunication University
Current assignee: Nanjing Post and Telecommunication University
Priority date: 2019-03-06
Filing date: 2019-03-06
Publication date: 2019-06-28

Abstract

The present invention discloses a kind of intelligently reading localization method based on deep learning, and by taking verbal exercise positions as an example, method includes shooting several verbal exercise pictures with mobile phone, and be classified as training set and test set；Using labelImg annotation tool, the position of verbal exercise on every picture in training set is marked with Bounding Box；Then xml document is generated, xml document is converted into txt file；It modifies to YOLOv3 algorithm, the classification of pupil's verbal exercise is added in classification, and data set is trained；Weight is saved after the completion of training, the picture in test set is tested, returns out the Bounding Box of each verbal exercise in every picture, realizes the positioning of pupil's verbal exercise.The present invention can be realized the positioning function to verbal exercises several in picture, and the accuracy of test result is higher, so as to reduce the workload of group signature.

Description

Intelligently reading localization method based on deep learning

Technical field

The present invention relates to a kind of intelligently reading localization method based on deep learning belongs to computer visual image processing skill Art field.

Background technique

With the continuous development of information technology, educational information technologyization also achieves the progress for making us proud.For teacher For, correcting a large amount of papers is extremely cumbersome and time-consuming work, and may not can guarantee correct ground completely it is correct, with people For work intelligence in the extensive use of all trades and professions, artificial intelligence can help teacher comment paper, so that the work for mitigating teacher is negative Load, and can guarantee the accuracy rate goed over examination papers.Computer vision field is to the identification of picture usually using OCR Text region at present System, OCR accurately cannot carry out subregion identification to picture, when in face of there is the paper of numerous topics, although OCR It can accurately identify text, but when occurring a large amount of texts in picture, be easy to happen the phenomenon of text confusion, this is because OCR cannot achieve to picture carry out subregion identification, it can only whole picture identify together, so as to cause that can not differentiate topic To mistake, so needing first to position the topic on paper before using OCR Text region.

Summary of the invention

The purpose of the present invention is to provide a kind of intelligently reading localization method based on deep learning, solves OCR text Identifying system can not auto-partition domain identification the problem of.

Intelligently reading localization method based on deep learning, comprising the following steps:

1) data set is created, detailed process is as follows:

11) the jpg format picture of several examination questions is obtained by way of shooting, wherein include in every examination question picture Examination question picture is divided into training set and test set by several examination questions；

12) every examination question picture in training set is marked using Bounding Box using labelImg annotation tool Note, marks the specific location of each examination question in every examination question picture；

13) after completing mark, the picture of jpg format is generated to the file of xml format；

14) file of xml format is converted to the label file of txt format, stores the picture inside each label file In each examination question bounding box five numerical value；First value indicates the class number of the examination question picture in five numerical value, the Two values indicate the central point x coordinate of the bounding box after normalization, and third value indicates the central point y of the bounding box after normalization Coordinate, the 4th value indicate the bounding box width after normalization, and the 5th value indicates the height of the bounding box after normalization, described Bounding box refers to the box of the mark of the Bounding Box comprising the examination question；

2) each examination question is positioned using the YOLOv3 algorithm based on deep learning, detailed process is as follows:

21) bounding box of each examination question is predicted；

22) classified to the bounding box predicted using multi-tag classification, be divided into 0 and 1 liang of class, indicate 0 boundary Frame removes in picture, and reservation represents 1 bounding box；

23) box of three kinds of different scales is predicted using YOLOv3 algorithm；

24) Darknet-53 network is constructed, feature extraction is carried out；

3) training set is put into Darknet-53 network and is trained, export the positioning of each examination question, training complete with After automatically generate weight file；

4) picture in test set is tested, exports the Bounding of each examination question precise positioning of test chart on piece Box。

In aforementioned step 11), training set and test set are divided according to setting ratio.

In aforementioned step 14), format conversion is carried out using voc_label.py file in YOLOv3 algorithm.

After format conversion above-mentioned, the picture of the label file of every examination question picture and original jpg format is placed on same In file.

Aforementioned step 21), bounding box predicted value are as follows:

b_x=σ (t_x)+c_x (1)

b_y=σ (t_y)+c_y (2)

Wherein, (b_x,b_y) it is the center point coordinate for predicting obtained bounding box, b_wTo predict the obtained width of bounding box, b_hFor the width for predicting obtained bounding box, σ (t_x)、σ(t_y) be coordinate square error loss, (t_x,t_y) indicate after normalizing Bounding box center point coordinate value, t_w,t_hThe width and height of bounding box after indicating normalization, (c_x,c_y) it is the upper left corner Offset, p_w,p_hFor the width and height of the bounding box before prediction.

During bounding box prediction above-mentioned, the score of an object is predicted each bounding box by logistic regression, The score is indicated with Duplication, if Duplication does not reach the threshold value of setting, the bounding box of the prediction will be neglected Slightly；The Duplication refers to the specific gravity between the bounding box that prediction obtains and true bounding box.

Threshold value above-mentioned is taken as 0.5.

In the training process, when the number of iterations is less than 1000 times, every 100 preservations are primary, when repeatedly for aforementioned step 3) When generation number is more than 1000 times, then every 10000 preservations are primary.

The beneficial effects obtained by the present invention are as follows are as follows:

The present invention creatively establishes the model of deep learning, and is applied and goed over examination papers positioning field in artificial intelligence, warp The training of YOLOv3 algorithm is crossed to extract the feature of verbal exercise, realizes the precise positioning to verbal exercise.

The present invention only needs to shoot verbal exercise picture can realize the function of precise positioning to verbal exercises several in figure, overcome OCR character identification system is unable to the problem of subregion identification, test result accuracy with higher and robustness, to realize Intelligently reading is had laid a good foundation.

Detailed description of the invention

Fig. 1 is the flow chart of the method for the present invention；

Fig. 2 is YOLOv3 network structure in the present invention；

Fig. 3 is the verbal exercise picture in test set；

Fig. 4 is the effect picture that Fig. 3 realizes verbal exercise positioning by test.

Specific embodiment

The invention will be further described below.Following embodiment is only used for clearly illustrating technical side of the invention Case, and not intended to limit the protection scope of the present invention.

The present invention realizes on python platform, referring to Fig. 1, mainly including the following steps:

Step 1: creation data set

The invention proposes a kind of intelligently reading localization method based on deep learning, this method is using the study for having supervision Method.Deep learning be unable to do without the development of data set in the development of all trades and professions, so establishing good data set is depth Habit provides good precondition.In order to preferably train deep learning network, the present invention is by taking the positioning of verbal exercise as an example, wound A small-sized verbal exercise data set is built.Specific creation process is as follows:

11) several pupil's verbal exercises are originally shot, is classified as training set and test set.

For example, by mobile phone shoot 120 pupil's verbal exercises high definition picture, and will wherein 90 be used as training set, Remaining 30 as test.

12) picture every in training set is labeled using labelImg annotation tool using Bounding Box, is marked Remember the specific location of each verbal exercise out, so that each verbal exercise is included in a Bounding Box, consequently facilitating The training of YOLOv3 network.When to verbal exercise, specific location is labeled in picture, due to being closer for each verbal exercise, It has to accurately outline its position, provides good precondition for the training of YOLOv3, and have very by test Good accuracy and robustness.

13) verbal exercise each in picture is carried out after precisely marking, a picture generates an xml document.By jpg format Picture generate the file of xml format, such as original picture is that " 88.jpg " format is generated as the file of " 88.xml " format.

The present invention realizes positioning, fortune using the YOLOv3 algorithm (You Only Look Once algorithm) based on deep learning Voc_label.py file converts the xml formatted file of generation to suitable for YOLOv3 network training in row YOLOv3 algorithm The label file of txt format.The label file (txt format) and original picture (jpg format) of every picture are placed on same file Underedge.Five numerical value of every verbal exercise bounding box on original image are stored inside each label file (txt format), such as: (1,0.314706855353,0.318333223444,0.273583234244,0.12)。

Bounding box refers to the box comprising each target position (every verbal exercise).The seat in the lower left corner of hypothetical boundary frame Mark is (x₁,y₁), the coordinate in the upper right corner is (x₂,y₂), wide and high respectively w and h.First value indicates the figure in five numerical value Class number, the category number can self-defining, second value indicate normalization after central point t_xCoordinate, third are worth table Central point t after showing normalization_yCoordinate, the 4th value indicate the target width of frame t after normalization_w, the 5th value expression normalization The height t of target frame afterwards_h, central point refers to the center of bounding box.

Step 2: it is realized using the YOLOv3 algorithm based on deep learning and each verbal exercise is positioned

The present invention modifies to the YOLOv3 algorithm based on deep learning, and the classification of pupil's verbal exercise is added.

21) predicted boundary frame

YOLOv3 algorithm obtains anchor boxes by the method clustered.YOLOv3 predicts t to each bounding box_x,t_y, t_w,t_hThis four coordinate values, (t_x,t_y) indicate center point coordinate value, t_w,t_hIndicate the width and height of bounding box.

For the cell (picture is divided into S × S latticed cell) of prediction, according to the offset in the picture upper left corner (c_x,c_y) --- the distance in the offset picture upper left corner, and the width and high p of the bounding box before prediction_w,p_h, can to bounding box according to According to following formula predictions:

b_x=σ (t_x)+c_x (1)

b_y=σ (t_y)+c_y (2)

b_x,b_y,b_w,b_hThe center point coordinate and size of the bounding box exactly predicted.σ(t_x)、σ(t_y) it is to sit Target loss, loses using square error.

In training b_x,b_y,b_w,b_hWhen uses sum of squared error loss, and (quadratic sum range error is damaged Lose), in this way, error can be calculated quickly.

YOLOv3 predicts by logistic regression each bounding box the score (Duplication) of an object, and Duplication refers to pre- Survey the specific gravity between frame and true frame.If this bounding box of prediction is Chong Die with true bounding box large area, and than it His predicted value of institute's bounding box is all good, then this value is exactly 1.It (is set here if Duplication does not reach a threshold value 0.5), then the bounding box of this prediction will be ignored, to represent free of losses value.

22) classify to bounding box

To each bounding box prediction classification (0 and 1 classification), indicate that 0 bounding box removes in figure, reservation represents 1 side Boundary's frame, bounding box use multi-tag classification (multi-label classification).YOLOv3 algorithm is using simply patrolling It collects to return and classify, entropy loss (binary cross-entropy loss) function is intersected using two-value.

23) across scale prediction

The network can predict the box of three kinds of different scales.YOLOv3 extracts spy using a feature pyramid network Sign.In essential characteristic extractor, it is added to several convolutional layers.Wherein the last layer prediction three-dimensional tensor is encoded Bounding Box (bounding box), objectness (object) and class prediction.Bounding is determined using k-means cluster Box priors (priori bounding box) selects nine clusters (cluster) and three scales (scale), then entire Even partition clusters (cluster) on scales (scale).

24) feature extraction is carried out

The Feature Selection Model of YOLOv3 mixes a variety of models, it has used YOLOv2, Darknet-19 and Residual error network, this model use 3 × 3 and 1 × 1 convolutional layer of better performances, are also added to shortcut Connection structure.Finally it has 53 convolutional layers, therefore is named as Darknet-53 network.Referring to fig. 2, Darknet-53 indicates 53 convolutional layers, practical to account for 74 layers (calculating by the network structure exported when running) altogether.detection Layer is responsible for predicting some scale (dividing grid number, have 3 scales 13,26 and 52 to predict respectively at 82,94,106 layers) (each grid predicts that the regressand value of 3 boxes includes coordinate, object and classification, total 3* (4+1+20)=75 to boxes regressand value A value).If route layers there are two parameters, two layers of connection, such as 86 layers of connection 85 and 61 are indicated；One parameter indicates this Route layers (such as 83 layers consistent with 79) consistent with that layer parameter.Shortcut layers of parameter layer is connected with this layer.Darknet- The 53 network applications skip floor connection type of residual error network, performance is better than ResNet-152 and ResNet-101 network, because of net Network basic unit has differences, and the network number of plies is fewer, and parameter is also few, and the calculation amount needed is few.Darknet-53 network can To realize highest floating-point operation per second, this, which represents network structure, can more effectively utilize GPU.

It is trained Step 4: training set is put into Darknet-53 network, exports the positioning of every verbal exercise, training Weight file is automatically generated after completing.When the number of iterations is less than 1000 times, every 100 preservations are primary, when the number of iterations is super When crossing 1000 times, then every 10000 preservations are primary.

Step 5: the picture in test set is tested, the output each verbal exercise precise positioning of test chart on piece Bounding Box, to complete the positioning to multiple verbal exercises on a picture.If Fig. 3 is one to dehisce arithmetic problem picture, Fig. 4 is The effect picture of verbal exercise positioning is realized by test.

The above is only a preferred embodiment of the present invention, it is noted that for the ordinary skill people of the art For member, without departing from the technical principles of the invention, several improvement and deformations can also be made, these improvement and deformations Also it should be regarded as protection scope of the present invention.

Claims

1. the intelligently reading localization method based on deep learning, which comprises the following steps:

1) data set is created, detailed process is as follows:

11) the jpg format picture of several examination questions is obtained by way of shooting, wherein comprising several in every examination question picture Examination question picture is divided into training set and test set by a examination question；

12) every examination question picture in training set is labeled using labelImg annotation tool using Bounding Box, is marked Remember the specific location of each examination question in every examination question picture out；

14) file of xml format is converted to the label file of txt format, is stored inside each label file every in the picture Five numerical value of a examination question bounding box；In five numerical value first value indicate the examination question picture class number, second Value indicates the central point x coordinate of the bounding box after normalization, and third value indicates that the central point y of the bounding box after normalization is sat Mark, the 4th value indicate the bounding box width after normalization, and the 5th value indicates the height of the bounding box after normalization, the side Boundary's frame refers to the box of the mark of the Bounding Box comprising the examination question；

21) bounding box of each examination question is predicted；

22) classified to the bounding box predicted using multi-tag classification, be divided into 0 and 1 liang of class, indicate that 0 bounding box exists Remove in picture, reservation represents 1 bounding box；

24) Darknet-53 network is constructed, feature extraction is carried out；

3) training set is put into Darknet-53 network and is trained, export the positioning of each examination question, training is completed later certainly It is dynamic to generate weight file；

4) picture in test set is tested, exports the Bounding Box of each examination question precise positioning of test chart on piece.

2. the intelligently reading localization method according to claim 1 based on deep learning, which is characterized in that the step 11) in, training set and test set are divided according to setting ratio.

3. the intelligently reading localization method according to claim 1 based on deep learning, which is characterized in that the step 14) in, format conversion is carried out using voc_label.py file in YOLOv3 algorithm.

4. the intelligently reading localization method according to claim 3 based on deep learning, which is characterized in that format conversion Afterwards, the picture of the label file of every examination question picture and original jpg format is placed on same file underedge.

5. the intelligently reading localization method according to claim 1 based on deep learning, which is characterized in that the step 21), bounding box predicted value is as follows:

b_x=σ (t_x)+c_x (1)

b_y=σ (t_y)+c_y (2)

Wherein, (b_x,b_y) it is the center point coordinate for predicting obtained bounding box, b_wFor the width for predicting obtained bounding box, b_hFor Predict the width of obtained bounding box, σ (t_x)、σ(t_y) be coordinate square error loss, (t_x,t_y) indicate the side after normalization The center point coordinate value of boundary's frame, t_w,t_hThe width and height of bounding box after indicating normalization, (c_x,c_y) it is the inclined of the upper left corner It moves, p_w,p_hFor the width and height of the bounding box before prediction.

6. the intelligently reading localization method according to claim 5 based on deep learning, which is characterized in that bounding box prediction In the process, predict each bounding box that the score of an object, the score are indicated with Duplication by logistic regression, such as Fruit Duplication does not reach the threshold value of setting, then the bounding box of the prediction will be ignored；The Duplication refers to that prediction obtains Bounding box and true bounding box between specific gravity.

7. the intelligently reading localization method according to claim 6 based on deep learning, which is characterized in that the threshold value takes It is 0.5.

8. the intelligently reading localization method according to claim 1 based on deep learning, which is characterized in that the step 3) In the training process, when the number of iterations is less than 1000 times, every 100 preservations are primary, when the number of iterations is more than 1000 times, then Every 10000 preservations are primary.