CN114387490A - Backbone design of end-side OCR recognition system based on NAS search - Google Patents

Backbone design of end-side OCR recognition system based on NAS search Download PDF

Info

Publication number
CN114387490A
CN114387490A CN202111471433.1A CN202111471433A CN114387490A CN 114387490 A CN114387490 A CN 114387490A CN 202111471433 A CN202111471433 A CN 202111471433A CN 114387490 A CN114387490 A CN 114387490A
Authority
CN
China
Prior art keywords
network
search
architecture
ocr
backhaul
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111471433.1A
Other languages
Chinese (zh)
Inventor
方徐伟
张帅
徐小龙
谢巍盛
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tianyi Electronic Commerce Co Ltd
Original Assignee
Tianyi Electronic Commerce Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tianyi Electronic Commerce Co Ltd filed Critical Tianyi Electronic Commerce Co Ltd
Priority to CN202111471433.1A priority Critical patent/CN114387490A/en
Publication of CN114387490A publication Critical patent/CN114387490A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2415Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on parametric or probabilistic models, e.g. based on likelihood ratio or false acceptance rate versus a false rejection rate
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/047Probabilistic or stochastic networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/082Learning methods modifying the architecture, e.g. adding, deleting or silencing nodes or connections

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Software Systems (AREA)
  • Mathematical Physics (AREA)
  • Computing Systems (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Computational Linguistics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a Backbone design of an end-side OCR recognition system based on NAS search, which comprises the following steps: the design of the OCR overall architecture and the design of an OCR system are divided into three modules, namely a differentiable backhaul, a detection head and a recognition head, wherein the detection head and the recognition head can be replaced by a common lightweight architecture for detection and recognition, and the discussion is omitted, so that the lightweight backhaul is mainly constructed. The invention designs a Backbone architecture for an OCR system at an end side by multi-task architecture search, designs an overall architecture and four search OPs of the OCR Backbone by drawing excellent experience of foreigners, optimizes the time delay and parameters of a network architecture and loss detected and identified by differentiable search, and finds an optimal solution among a model effect, a model parameter and a model time delay; the method can replace manually designed backhaul to find the optimal deployment architecture.

Description

Backbone design of end-side OCR recognition system based on NAS search
Technical Field
The invention relates to the field of OCR, Automl and NAS, in particular to a Backbone design of an end-side OCR recognition system based on NAS search.
Background
OCR, optical character recognition, refers to the process of translating characters in a picture into computer text by a character recognition method. The method can be generally applied to the recognition of various documents, various bills, various certificates and the like, and is one of the few technologies (based on deep learning) which can be really landed in actual production, and the OCR is generally divided into two steps: and (3) detecting, identifying and post-processing the characters. There are generally two ways to detect and identify text: two-stage text detection + text recognition and single-stage end2end detection recognition. The post-treatment can be roughly divided into two types: a priori knowledge based post-processing and deep learning based post-processing.
Since 2016, automl technology is continuously developed, especially since 2018, various papers about automatic parameter adjustment and automatic Search are found at various tops, NAS is taken as one of branches of automl, and is also concerned by students and cattle, various factories and colleges are also invested in research, NAS is called Neural Architecture Search, and Neural Architecture is automatically searched by defining a Search space and a Search algorithm, so that the artificial priori knowledge and the artificial bias are reduced, and a better Neural network Architecture is expected to be searched.
The current OCR recognition modes can be divided into two types: the method has the advantages that the model is deployed on the server, a large model can be used, so that the recognition rate is higher, the defects are that data needs to be transmitted at two ends, the time consumption of data transmission and the risk of transmission failure are increased, the picture needs to be compressed in the common transmission process, and the picture distortion is caused by a certain probability so as to influence the recognition accuracy rate. And the model is deployed at the end side, so that the image loss caused by data transmission and data compression can be directly avoided. The method has the disadvantages that the end side can not deploy a large model, and the model needs to be reduced in various compression and pruning modes, so that the precision loss is caused to a certain extent, the computing capability of the end side is limited, and the model still needs to consider the computing capability and the computing time delay. The deployment limitation of the OCR on the end side mainly lies in the backhaul, so the invention hopes to explore the backhaul of the OCR framework which is more excellent on the end side through the NAS technology, reduce the bias of artificially designing the backhaul, optimize the recognition effect and the recognition speed, and be more suitable for deployment on the end side.
Disclosure of Invention
The technical problem to be solved by the invention is to overcome the defects of the prior art and provide a backhaul design of an end-side OCR recognition system based on NAS search.
The invention provides the following technical scheme:
the invention provides a Backbone design of an end-side OCR recognition system based on NAS search, which comprises the following steps:
firstly, designing an OCR overall architecture:
the design of the OCR system is divided into three modules, namely a differentiable backhaul, a detection head and an identification head, wherein the detection head and the identification head can be replaced by a common light-weight framework for detection and identification, and the light-weight backhaul is mainly constructed without discussion;
secondly, the architecture design of the backhaul:
firstly, the overall architecture of the backhaul identified by the OCR needs to be designed, and the image classification network in the NANET is optimized by the architecture:
n represents the number of the layer, S represents the downward decreasing multiple of the picture or the map, and the structure uses the downsampling scale of 16 times, so that the network receptive field can be greatly improved, and the detection of the large length-width ratio of the text can be greatly improved;
thirdly, designing a pooling cell:
according to the results of the previous NAS search, whether the pooled cell can be searched or not does not greatly contribute to the network performance, so in order to reduce the network search time and consider the resource problem (here, only the single GPU search), the pooled cell is designed:
the pooling cell has the following advantages that the width of the network is widened firstly, different information can be collected according to the googlenet, the accuracy is improved, and shallow information can be combined by combining the thinking of a residual error network secondly; finally, integrating the information through summation operation; by introducing the pooling cell, the search space is reduced;
fourthly, designing a search space of the convolution cell:
the search of the connection mode is not carried out, only the search of the OP type is carried out, and 4 types of OPs are defined;
four kinds of end-side-based OPs are designed according to dw convolution proposed in mobilenet to jointly form a convolution cell;
the specific calculation of the combination mode of the ops in the convolution cell is shown in formula 1:
Figure BDA0003392645860000031
equation 1 is a convolution cell used to calculate each layer, where X represents the input map and X' represents the output map, wiArchitectural parameters representing this layer;
fifthly, differentiable design:
since the architecture parameters are discretized, differential operation cannot be performed, and then the network architecture parameters are subjected to reparameterization in a mode of combining probability distribution and softmax, so that differentiation can be performed along with the network; the specific operation mode is as follows:
step 1: assuming that the network output value is a vector a with n dimensions, an independent sample [ b ] of the same dimension as a and the same dimension of the chamber distribution is generated1,b2...,bn];
Step2 by the formula-log (-log (b)i) C) is calculated to obtaini
Step3, adding the corresponding vectors to obtain a new vector a ═ a1+c1,...,an+cn];
Step4, calculating a final result through a softmax formula, wherein the softmax formula is shown as a formula 2:
Figure BDA0003392645860000041
where τ represents temperature, where the value decreases as the number of epochs trained increases;
sixthly, designing time delay and model parameter quantity:
since the OCR model is deployed on the end side, the size and the time delay of the model need to be taken into consideration in the searching process, so a manner that differentiable optimization can be performed along with model training is designed, and the specific steps are as follows:
step 1: compute runtime and model parameter size for each designed op individually, denoted as
Figure BDA0003392645860000042
And
Figure BDA0003392645860000043
where i represents the number op, i ≦ 4, and l represents which convolutional cell it is located in, where l ≦ 6;
step2 multiplying each op corresponding to each layer from the parameterized network architecture parameters
Figure BDA0003392645860000044
And
Figure BDA0003392645860000045
summing all layers, so that the calculation delay and the network parameter quantity of the designed backhaul network can be obtained;
step3: performing multi-task optimization on the calculated network delay and network parameter quantity together with loss of network detection and identification; the calculation formula is as follows:
ltotal=ldet+lrecog+α*lt+β*lm
the alpha and beta represent the weight of each loss, the larger the weight of each loss is, the lighter the searched network is, and the balance between effect precision and model size and time delay needs to be made, and the balance can be adjusted according to experimental effects; and after the search is finished, selecting the op operation with the maximum value according to the size of the sofmax value of the architecture parameter to combine into the final backhaul.
Compared with the prior art, the invention has the following beneficial effects:
the method designs a Backbone architecture for an OCR system at an end side through multi-task architecture search, designs an overall architecture and four search OPs of the OCR Backbone by drawing excellent experience of predecessors, optimizes the time delay and parameters of a network architecture and loss detected and identified through differentiable search, and finds an optimal solution among a model effect, a model parameter and a model time delay; the method can replace manually designed backhaul to find the optimal deployment architecture.
Drawings
The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the principles of the invention and not to limit the invention. In the drawings:
FIG. 1 is a diagram of a prior art NAS search process;
FIG. 2 is a schematic diagram of the overall OCR architecture of the present invention;
FIG. 3 is a schematic diagram of the underlying network of the present invention;
FIG. 4 is a schematic diagram of a pooled cell of the present invention;
FIG. 5 is a schematic diagram of 4 ops according to the invention;
FIG. 6 is a schematic diagram of the 4 op combinations of the present invention.
Detailed Description
The preferred embodiments of the present invention will be described in conjunction with the accompanying drawings, and it will be understood that they are described herein for the purpose of illustration and explanation and not limitation. Wherein like reference numerals refer to like parts throughout.
Example 1
Referring to fig. 1 to 6, the present invention provides a backhaul design of an end-side OCR recognition system based on NAS search, comprising the following:
firstly, designing an OCR overall architecture:
as shown in fig. 2, the design of the OCR system is divided into three modules, a differentiable backhaul, a detection head and a recognition head, where the detection head and the recognition head can be replaced by a light-weight framework for detection and recognition, which is not discussed here, and we mainly aim to construct a light-weight backhaul;
secondly, the architecture design of the backhaul:
firstly, the overall architecture of the Backbone recognized by the OCR needs to be designed, here, we do some architectural optimization to the image classification network in NASnet, and the overall architecture design of the Backbone is shown in fig. 3:
as shown in fig. 3, N in the figure represents the number of the layer, S represents the downward-decreasing multiple of the picture or the map, and the structure of the invention uses a down-sampling scale of 16 times, so that the network receptive field can be greatly improved, and the detection of a large aspect ratio such as a text can be greatly improved;
thirdly, designing a pooling cell:
according to the results of the previous NAS search, whether the pooled cell can be searched or not does not greatly contribute to the network performance, so in order to reduce the time of the network search and consider the resource problem (here, we only search in a single GPU), we design the pooled cell as shown in fig. 4:
the designed pooling cell has the following advantages that the width of the network is widened firstly, people know that different information can be collected according to googlenet, so that the accuracy is improved, and the shallow information can be combined together by combining the thought of a residual error network; finally, integrating the information through summation operation; by introducing the pooled cells shown in the upper graph, the search space is reduced;
fourthly, designing a search space of the convolution cell:
here we do no search for the connection mode, only do an OP type search, here we define 4 types of OPs as shown in figure 5,
here we have designed four kinds of end-side based OPs to jointly constitute a convolutional cell according to dw convolution proposed in mobilenet; the combination of OP is shown in fig. 6:
the combination mode of ops in a convolution cell is shown in fig. 6, and the specific calculation is shown in formula 1:
Figure BDA0003392645860000061
equation 1 is a convolution cell used to calculate each layer, where X represents the input map and X' represents the output map, wiArchitectural parameters representing this layer;
fifthly, differentiable design:
since the architecture parameters are discretized, differential operation cannot be performed, so that the network architecture parameters are re-parameterized by introducing a mode of combining probability distribution and softmax, so that differentiation can be performed along with the network; the specific operation mode is as follows:
step 1: assuming that the network output value is a vector a with n dimensions, an independent sample [ b ] of the same dimension as a and the same dimension of the chamber distribution is generated1,b2...,bn];
Step2 by the formula-log (-log (b)i) C) is calculated to obtaini
Step3, adding the corresponding vectors to obtain a new vector a ═ a1+c1,...,an+cn];
Step4, calculating a final result through a softmax formula, wherein the softmax formula is shown as a formula 2:
Figure BDA0003392645860000071
where τ represents temperature, where the value decreases as the number of epochs trained increases;
sixthly, designing time delay and model parameter quantity:
since the OCR model is deployed on the end side, we need to take the size and time delay of the model into consideration during the search process, and we design a way to perform differentiable optimization along with model training, which includes the following specific steps:
step 1: compute runtime and model parameter size for each designed op individually, denoted as
Figure BDA0003392645860000072
And
Figure BDA0003392645860000073
where i represents the number op, i ≦ 4, and l represents which convolutional cell it is located in, where l ≦ 6;
step2 multiplying each op corresponding to each layer from the parameterized network architecture parameters
Figure BDA0003392645860000074
And
Figure BDA0003392645860000075
summing all layers, so that the calculation delay and the network parameter quantity of the designed backhaul network can be obtained;
step3: performing multi-task optimization on the calculated network delay and network parameter quantity together with loss of network detection and identification; the calculation formula is as follows:
ltotal=ldet+lrecog+α*lt+β*lm
the alpha and beta represent the weight of each loss, the larger the weight of each loss is, the lighter the searched network is, and the balance between effect precision and model size and time delay needs to be made, and the balance can be adjusted according to experimental effects; after the search is finished, the op operations with the maximum values are selected according to the sizes of the sofmax values of the architecture parameters to be combined into a final backhaul.
Finally, it should be noted that: although the present invention has been described in detail with reference to the foregoing embodiments, it will be apparent to those skilled in the art that changes may be made in the embodiments and/or equivalents thereof without departing from the spirit and scope of the invention. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (1)

1. The Backbone design of the end-side OCR recognition system based on NAS search is characterized by comprising the following steps:
firstly, designing an OCR overall architecture:
the design of the OCR system is divided into three modules, namely a differentiable backhaul, a detection head and an identification head, wherein the detection head and the identification head can be replaced by a common light-weight framework for detection and identification, and the light-weight backhaul is mainly constructed without discussion;
secondly, the architecture design of the backhaul:
firstly, the overall architecture of the backhaul identified by the OCR needs to be designed, and the image classification network in the NANET is optimized by the architecture:
n represents the number of the layer, S represents the downward decreasing multiple of the picture or the map, and the structure uses the downsampling scale of 16 times, so that the network receptive field can be greatly improved, and the detection of the large length-width ratio of the text can be greatly improved;
thirdly, designing a pooling cell:
according to the results of the previous NAS search, whether the pooled cell can be searched or not does not greatly contribute to the network performance, so in order to reduce the network search time and consider the resource problem (here, only the single GPU search), the pooled cell is designed:
the pooling cell has the following advantages that the width of the network is widened firstly, different information can be collected according to the googlenet, the accuracy is improved, and shallow information can be combined by combining the thinking of a residual error network secondly; finally, integrating the information through summation operation; by introducing the pooling cell, the search space is reduced;
fourthly, designing a search space of the convolution cell:
the search of the connection mode is not carried out, only the search of the OP type is carried out, and 4 types of OPs are defined;
four kinds of end-side-based OPs are designed according to dw convolution proposed in mobilenet to jointly form a convolution cell;
the specific calculation of the combination mode of the ops in the convolution cell is shown in formula 1:
Figure FDA0003392645850000021
equation 1 is a convolution cell used to calculate each layer, where X represents the input map and X' represents the output map, wiArchitectural parameters representing this layer;
fifthly, differentiable design:
since the architecture parameters are discretized, differential operation cannot be performed, and then the network architecture parameters are subjected to reparameterization in a mode of combining probability distribution and softmax, so that differentiation can be performed along with the network; the specific operation mode is as follows:
step 1: assuming that the network output value is a vector a with n dimensions, an independent sample [ b ] of the same dimension as a and the same dimension of the chamber distribution is generated1,b2...,bn];
Step2 by the formula-log (-log (b)i) C) is calculated to obtaini
Step3, adding the corresponding vectors to obtain a new vector a ═ a1+c1,...,an+cn];
Step4, calculating a final result through a softmax formula, wherein the softmax formula is shown as a formula 2:
Figure FDA0003392645850000022
where τ represents temperature, where the value decreases as the number of epochs trained increases;
sixthly, designing time delay and model parameter quantity:
since the OCR model is deployed on the end side, the size and the time delay of the model need to be taken into consideration in the searching process, so a manner that differentiable optimization can be performed along with model training is designed, and the specific steps are as follows:
step 1: compute runtime and model parameter size for each designed op individually, denoted as
Figure FDA0003392645850000023
And
Figure FDA0003392645850000031
where i represents the number op, i ≦ 4, and l represents which convolutional cell it is located in, where l ≦ 6;
step2 multiplying each op corresponding to each layer from the parameterized network architecture parameters
Figure FDA0003392645850000032
And
Figure FDA0003392645850000033
summing all layers, so that the calculation delay and the network parameter quantity of the designed backhaul network can be obtained;
step3: performing multi-task optimization on the calculated network delay and network parameter quantity together with loss of network detection and identification; the calculation formula is as follows:
ltotal=ldet+lrecog+α*lt+β*lm
the alpha and beta represent the weight of each loss, the larger the weight of each loss is, the lighter the searched network is, and the balance between effect precision and model size and time delay needs to be made, and the balance can be adjusted according to experimental effects; and after the search is finished, selecting the op operation with the maximum value according to the size of the sofmax value of the architecture parameter to combine into the final backhaul.
CN202111471433.1A 2021-12-04 2021-12-04 Backbone design of end-side OCR recognition system based on NAS search Pending CN114387490A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111471433.1A CN114387490A (en) 2021-12-04 2021-12-04 Backbone design of end-side OCR recognition system based on NAS search

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111471433.1A CN114387490A (en) 2021-12-04 2021-12-04 Backbone design of end-side OCR recognition system based on NAS search

Publications (1)

Publication Number Publication Date
CN114387490A true CN114387490A (en) 2022-04-22

Family

ID=81196390

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111471433.1A Pending CN114387490A (en) 2021-12-04 2021-12-04 Backbone design of end-side OCR recognition system based on NAS search

Country Status (1)

Country Link
CN (1) CN114387490A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116522999A (en) * 2023-06-26 2023-08-01 深圳思谋信息科技有限公司 Model searching and time delay predictor training method, device, equipment and storage medium

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116522999A (en) * 2023-06-26 2023-08-01 深圳思谋信息科技有限公司 Model searching and time delay predictor training method, device, equipment and storage medium
CN116522999B (en) * 2023-06-26 2023-12-15 深圳思谋信息科技有限公司 Model searching and time delay predictor training method, device, equipment and storage medium

Similar Documents

Publication Publication Date Title
CN110472817B (en) XGboost integrated credit evaluation system and method combined with deep neural network
CN111583263B (en) Point cloud segmentation method based on joint dynamic graph convolution
CN108985317B (en) Image classification method based on separable convolution and attention mechanism
Asif et al. Ensemble knowledge distillation for learning improved and efficient networks
CN112163628A (en) Method for improving target real-time identification network structure suitable for embedded equipment
CN110263855B (en) Method for classifying images by utilizing common-basis capsule projection
CN112732921B (en) False user comment detection method and system
CN111597943B (en) Table structure identification method based on graph neural network
CN114612761A (en) Network architecture searching method for image recognition
CN114091650A (en) Searching method and application of deep convolutional neural network architecture
CN112507114A (en) Multi-input LSTM-CNN text classification method and system based on word attention mechanism
CN114387490A (en) Backbone design of end-side OCR recognition system based on NAS search
CN114121163A (en) Culture medium prediction system based on ensemble learning, training and culture medium prediction method
CN116756391A (en) Unbalanced graph node neural network classification method based on graph data enhancement
Yang et al. Skeleton Neural Networks via Low-rank Guided Filter Pruning
CN108446718B (en) Dynamic deep confidence network analysis method
CN110543567A (en) Chinese text emotion classification method based on A-GCNN network and ACELM algorithm
CN114168782B (en) Deep hash image retrieval method based on triplet network
CN112149556B (en) Face attribute identification method based on deep mutual learning and knowledge transfer
CN114596567A (en) Handwritten digit recognition method based on dynamic feedforward neural network structure and growth rate function
Li et al. A PSO-based fine-tuning algorithm for CNN
CN117556064B (en) Information classification storage method and system based on big data analysis
Cai et al. Implementation of hybrid deep learning architecture on loop-closure detection
Nguyen et al. Improve object detection performance with efficient task-alignment module
CN117315324B (en) Lightweight class detection method and system for Mars rugged topography

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication