CN114387490A

CN114387490A - Backbone design of end-side OCR recognition system based on NAS search

Info

Publication number: CN114387490A
Application number: CN202111471433.1A
Authority: CN
Inventors: 方徐伟; 张帅; 徐小龙; 谢巍盛
Original assignee: Tianyi Electronic Commerce Co Ltd
Current assignee: Tianyi Electronic Commerce Co Ltd
Priority date: 2021-12-04
Filing date: 2021-12-04
Publication date: 2022-04-22

Abstract

The invention discloses a Backbone design of an end-side OCR recognition system based on NAS search, which comprises the following steps: the design of the OCR overall architecture and the design of an OCR system are divided into three modules, namely a differentiable backhaul, a detection head and a recognition head, wherein the detection head and the recognition head can be replaced by a common lightweight architecture for detection and recognition, and the discussion is omitted, so that the lightweight backhaul is mainly constructed. The invention designs a Backbone architecture for an OCR system at an end side by multi-task architecture search, designs an overall architecture and four search OPs of the OCR Backbone by drawing excellent experience of foreigners, optimizes the time delay and parameters of a network architecture and loss detected and identified by differentiable search, and finds an optimal solution among a model effect, a model parameter and a model time delay; the method can replace manually designed backhaul to find the optimal deployment architecture.

Description

Backbone design of end-side OCR recognition system based on NAS search

Technical Field

The invention relates to the field of OCR, Automl and NAS, in particular to a Backbone design of an end-side OCR recognition system based on NAS search.

Background

OCR, optical character recognition, refers to the process of translating characters in a picture into computer text by a character recognition method. The method can be generally applied to the recognition of various documents, various bills, various certificates and the like, and is one of the few technologies (based on deep learning) which can be really landed in actual production, and the OCR is generally divided into two steps: and (3) detecting, identifying and post-processing the characters. There are generally two ways to detect and identify text: two-stage text detection + text recognition and single-stage end2end detection recognition. The post-treatment can be roughly divided into two types: a priori knowledge based post-processing and deep learning based post-processing.

Since 2016, automl technology is continuously developed, especially since 2018, various papers about automatic parameter adjustment and automatic Search are found at various tops, NAS is taken as one of branches of automl, and is also concerned by students and cattle, various factories and colleges are also invested in research, NAS is called Neural Architecture Search, and Neural Architecture is automatically searched by defining a Search space and a Search algorithm, so that the artificial priori knowledge and the artificial bias are reduced, and a better Neural network Architecture is expected to be searched.

The current OCR recognition modes can be divided into two types: the method has the advantages that the model is deployed on the server, a large model can be used, so that the recognition rate is higher, the defects are that data needs to be transmitted at two ends, the time consumption of data transmission and the risk of transmission failure are increased, the picture needs to be compressed in the common transmission process, and the picture distortion is caused by a certain probability so as to influence the recognition accuracy rate. And the model is deployed at the end side, so that the image loss caused by data transmission and data compression can be directly avoided. The method has the disadvantages that the end side can not deploy a large model, and the model needs to be reduced in various compression and pruning modes, so that the precision loss is caused to a certain extent, the computing capability of the end side is limited, and the model still needs to consider the computing capability and the computing time delay. The deployment limitation of the OCR on the end side mainly lies in the backhaul, so the invention hopes to explore the backhaul of the OCR framework which is more excellent on the end side through the NAS technology, reduce the bias of artificially designing the backhaul, optimize the recognition effect and the recognition speed, and be more suitable for deployment on the end side.

Disclosure of Invention

The technical problem to be solved by the invention is to overcome the defects of the prior art and provide a backhaul design of an end-side OCR recognition system based on NAS search.

The invention provides the following technical scheme:

the invention provides a Backbone design of an end-side OCR recognition system based on NAS search, which comprises the following steps:

firstly, designing an OCR overall architecture:

the design of the OCR system is divided into three modules, namely a differentiable backhaul, a detection head and an identification head, wherein the detection head and the identification head can be replaced by a common light-weight framework for detection and identification, and the light-weight backhaul is mainly constructed without discussion;

secondly, the architecture design of the backhaul:

firstly, the overall architecture of the backhaul identified by the OCR needs to be designed, and the image classification network in the NANET is optimized by the architecture:

n represents the number of the layer, S represents the downward decreasing multiple of the picture or the map, and the structure uses the downsampling scale of 16 times, so that the network receptive field can be greatly improved, and the detection of the large length-width ratio of the text can be greatly improved;

thirdly, designing a pooling cell:

according to the results of the previous NAS search, whether the pooled cell can be searched or not does not greatly contribute to the network performance, so in order to reduce the network search time and consider the resource problem (here, only the single GPU search), the pooled cell is designed:

the pooling cell has the following advantages that the width of the network is widened firstly, different information can be collected according to the googlenet, the accuracy is improved, and shallow information can be combined by combining the thinking of a residual error network secondly; finally, integrating the information through summation operation; by introducing the pooling cell, the search space is reduced;

fourthly, designing a search space of the convolution cell:

the search of the connection mode is not carried out, only the search of the OP type is carried out, and 4 types of OPs are defined;

four kinds of end-side-based OPs are designed according to dw convolution proposed in mobilenet to jointly form a convolution cell;

the specific calculation of the combination mode of the ops in the convolution cell is shown in formula 1:

equation 1 is a convolution cell used to calculate each layer, where X represents the input map and X' represents the output map, w_iArchitectural parameters representing this layer;

fifthly, differentiable design:

since the architecture parameters are discretized, differential operation cannot be performed, and then the network architecture parameters are subjected to reparameterization in a mode of combining probability distribution and softmax, so that differentiation can be performed along with the network; the specific operation mode is as follows:

step 1: assuming that the network output value is a vector a with n dimensions, an independent sample [ b ] of the same dimension as a and the same dimension of the chamber distribution is generated₁，b₂...，b_n]；

Step2 by the formula-log (-log (b)_i) C) is calculated to obtain_i；

Step3, adding the corresponding vectors to obtain a new vector a ═ a₁+c₁，...，a_n+c_n]；

Step4, calculating a final result through a softmax formula, wherein the softmax formula is shown as a formula 2:

where τ represents temperature, where the value decreases as the number of epochs trained increases;

sixthly, designing time delay and model parameter quantity:

since the OCR model is deployed on the end side, the size and the time delay of the model need to be taken into consideration in the searching process, so a manner that differentiable optimization can be performed along with model training is designed, and the specific steps are as follows:

step 1: compute runtime and model parameter size for each designed op individually, denoted as

And

where i represents the number op, i ≦ 4, and l represents which convolutional cell it is located in, where l ≦ 6;

step2 multiplying each op corresponding to each layer from the parameterized network architecture parameters

And

summing all layers, so that the calculation delay and the network parameter quantity of the designed backhaul network can be obtained;

step3: performing multi-task optimization on the calculated network delay and network parameter quantity together with loss of network detection and identification; the calculation formula is as follows:

l_total＝l_det+l_recog+α*l_t+β*l_m

the alpha and beta represent the weight of each loss, the larger the weight of each loss is, the lighter the searched network is, and the balance between effect precision and model size and time delay needs to be made, and the balance can be adjusted according to experimental effects; and after the search is finished, selecting the op operation with the maximum value according to the size of the sofmax value of the architecture parameter to combine into the final backhaul.

Compared with the prior art, the invention has the following beneficial effects:

the method designs a Backbone architecture for an OCR system at an end side through multi-task architecture search, designs an overall architecture and four search OPs of the OCR Backbone by drawing excellent experience of predecessors, optimizes the time delay and parameters of a network architecture and loss detected and identified through differentiable search, and finds an optimal solution among a model effect, a model parameter and a model time delay; the method can replace manually designed backhaul to find the optimal deployment architecture.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this specification, illustrate embodiments of the invention and together with the description serve to explain the principles of the invention and not to limit the invention. In the drawings:

FIG. 1 is a diagram of a prior art NAS search process;

FIG. 2 is a schematic diagram of the overall OCR architecture of the present invention;

FIG. 3 is a schematic diagram of the underlying network of the present invention;

FIG. 4 is a schematic diagram of a pooled cell of the present invention;

FIG. 5 is a schematic diagram of 4 ops according to the invention;

FIG. 6 is a schematic diagram of the 4 op combinations of the present invention.

Detailed Description

The preferred embodiments of the present invention will be described in conjunction with the accompanying drawings, and it will be understood that they are described herein for the purpose of illustration and explanation and not limitation. Wherein like reference numerals refer to like parts throughout.

Example 1

Referring to fig. 1 to 6, the present invention provides a backhaul design of an end-side OCR recognition system based on NAS search, comprising the following:

firstly, designing an OCR overall architecture:

as shown in fig. 2, the design of the OCR system is divided into three modules, a differentiable backhaul, a detection head and a recognition head, where the detection head and the recognition head can be replaced by a light-weight framework for detection and recognition, which is not discussed here, and we mainly aim to construct a light-weight backhaul;

secondly, the architecture design of the backhaul:

firstly, the overall architecture of the Backbone recognized by the OCR needs to be designed, here, we do some architectural optimization to the image classification network in NASnet, and the overall architecture design of the Backbone is shown in fig. 3:

as shown in fig. 3, N in the figure represents the number of the layer, S represents the downward-decreasing multiple of the picture or the map, and the structure of the invention uses a down-sampling scale of 16 times, so that the network receptive field can be greatly improved, and the detection of a large aspect ratio such as a text can be greatly improved;

thirdly, designing a pooling cell:

according to the results of the previous NAS search, whether the pooled cell can be searched or not does not greatly contribute to the network performance, so in order to reduce the time of the network search and consider the resource problem (here, we only search in a single GPU), we design the pooled cell as shown in fig. 4:

the designed pooling cell has the following advantages that the width of the network is widened firstly, people know that different information can be collected according to googlenet, so that the accuracy is improved, and the shallow information can be combined together by combining the thought of a residual error network; finally, integrating the information through summation operation; by introducing the pooled cells shown in the upper graph, the search space is reduced;

fourthly, designing a search space of the convolution cell:

here we do no search for the connection mode, only do an OP type search, here we define 4 types of OPs as shown in figure 5,

here we have designed four kinds of end-side based OPs to jointly constitute a convolutional cell according to dw convolution proposed in mobilenet; the combination of OP is shown in fig. 6:

the combination mode of ops in a convolution cell is shown in fig. 6, and the specific calculation is shown in formula 1:

fifthly, differentiable design:

since the architecture parameters are discretized, differential operation cannot be performed, so that the network architecture parameters are re-parameterized by introducing a mode of combining probability distribution and softmax, so that differentiation can be performed along with the network; the specific operation mode is as follows:

Step2 by the formula-log (-log (b)_i) C) is calculated to obtain_i；

sixthly, designing time delay and model parameter quantity:

since the OCR model is deployed on the end side, we need to take the size and time delay of the model into consideration during the search process, and we design a way to perform differentiable optimization along with model training, which includes the following specific steps:

And

And

l_total＝l_det+l_recog+α*l_t+β*l_m

the alpha and beta represent the weight of each loss, the larger the weight of each loss is, the lighter the searched network is, and the balance between effect precision and model size and time delay needs to be made, and the balance can be adjusted according to experimental effects; after the search is finished, the op operations with the maximum values are selected according to the sizes of the sofmax values of the architecture parameters to be combined into a final backhaul.

Finally, it should be noted that: although the present invention has been described in detail with reference to the foregoing embodiments, it will be apparent to those skilled in the art that changes may be made in the embodiments and/or equivalents thereof without departing from the spirit and scope of the invention. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. The Backbone design of the end-side OCR recognition system based on NAS search is characterized by comprising the following steps:

firstly, designing an OCR overall architecture:

secondly, the architecture design of the backhaul:

thirdly, designing a pooling cell:

fourthly, designing a search space of the convolution cell:

fifthly, differentiable design:

Step2 by the formula-log (-log (b)_i) C) is calculated to obtain_i；

sixthly, designing time delay and model parameter quantity:

And

And

l_total＝l_det+l_recog+α*l_t+β*l_m