CN108416349A

CN108416349A - Identify deviation-rectifying system and method

Info

Publication number: CN108416349A
Application number: CN201810087635.8A
Authority: CN
Inventors: 王志成; 张玉双; 王亮; 高磊; 邝展豪; 刘志欣; 胡奉平
Original assignee: SF Technology Co Ltd
Current assignee: SF Technology Co Ltd; SF Tech Co Ltd
Priority date: 2018-01-30
Filing date: 2018-01-30
Publication date: 2018-08-17

Abstract

The present invention relates to a kind of identification deviation-rectifying system and methods, and calculation system is carried out to papery document information, including：Image acquisition unit, for obtaining papery document image；Recognition unit, the word in the document image, obtains the recognition result data set of document for identification；Information correction unit, for rectifying a deviation to the recognition result data set, the document information after being rectified a deviation；Data storage cell, for storing the document information after information database, the document image, the recognition result data set of the document and the correction.Wherein, described information database is the training data of described information correction unit.A kind of identification deviation-rectifying system of the present invention and method improve the relatively low drawback of OCR technique identification word sequence precision accuracy rate.

Description

Identify deviation-rectifying system and method

Technical field

The present invention relates to technical field of character recognition, are particularly suitable for the Text region of the hand-written address area of logistics order.

Background technology

In recent years, with the fast development of e-commerce, express mail receipts are posted in explosive growth, the record of express waybill address information Enter mainly by traditional artificial typing one by one.

Traditional manual entry express delivery single address mode efficiency is low, and since hand-written express waybill writing is complicated, express delivery It is single to there is the incorrect or infull risk in address in itself, cause to send difficulty with charge free.

With the development of character recognition technology, also induce one OCR identification technologies to the identification of hand-written papery express waybill, But OCR technique is stringenter to the Regulatory requirements of handwriting, to the word sequence accuracy of identification and accuracy rate of handwriting It is relatively low, after obtaining its recognition result, also relies primarily on manual intervention and its result is modified and completion, it is less efficient.

Invention content

For the OCR technique problem relatively low to the discrimination of hand-written document, the present invention provides a kind of identification deviation-rectifying systems And method, improve the relatively low drawback of OCR technique identification word sequence precision accuracy rate.

The present invention relates to a kind of identification deviation-rectifying systems, and calculation system is carried out to papery document information, including：Image obtains Unit, for obtaining papery document image；Recognition unit, the word in the document image, obtains the knowledge of document for identification Other text sequence；Information correction unit, for rectifying a deviation to identification text sequence, the document information after being rectified a deviation.

Preferably, identification deviation-rectifying system of the invention further includes data storage cell, for storing information database, document Image, the identification text sequence of document and the document information after correction, wherein information database is the training of information correction unit Data.

Preferably, the identification text sequence of input is mapped as using sequence to sequence algorithm model defeated by information correction unit The text sequence gone out.

Preferably, sequence includes encoder and decoder to sequence algorithm model, and encoder will identify text sequence and letter Breath database is compared, and comparison result mapping output is the document information after correction by decoder.

Preferably, encoder obtains context vector, decoder root by the state updating section of coding calculating text sequence Document information after being rectified a deviation according to current state and context vector decoding.

Preferably, information correction model is literary by the identification of input using sequence to sequence algorithm models coupling attention mechanism This sequence is mapped as the text sequence of output.

Preferably, attention mechanism is associated with the identification text sequence at each moment as the language weighted by context vector Border vector, decoder decode the document information after being rectified a deviation.

Preferably, the unit carries out Text region using OCR technique to document image.

Preferably, papery document information is express waybill address information.

Preferably, information database is the complete address information training data that big data and geography information are integrated.

Preferably, recognition result data set is compared by information correction unit with complete address information training data, right Express waybill address carries out error correction and completion, and mapping output is sufficient address information.

The invention further relates to a kind of identification method for correcting error, carry out calculation system to papery document information, including walk as follows Suddenly：Papery document image is obtained by image acquisition unit；Character recognition technology is carried out to the document image with recognition unit Processing, obtains the identification text sequence of document；It is rectified a deviation, is entangled to the identification text sequence by information correction unit Document information to the rear.

Preferably, identification method for correcting error of the invention further includes that information database is prestored to data storage cell, And the document information after recognition result data set to document image, the document and correction the step of storing, In, described information database is the training data of described information correction unit.

Identification deviation-rectifying system provided by the invention and method carry out calculation system to papery document information, are rectified a deviation The complete document information of completion solves the problems, such as that traditional artificial efficiency of inputting is low and OCR recognition accuracies are not high, improves Working efficiency and the accuracy rate of typing.

Description of the drawings

Below with reference to the accompanying drawings the preferred embodiment of the present invention described, attached drawing in order to illustrate the preferred embodiment of the present invention without It is to limit the purpose of the present invention.In attached drawing,

Fig. 1 is the overall procedure block diagram of the identification deviation-rectifying system of the embodiment of the present invention；

Fig. 2 be the embodiment of the present invention sequence to sequence algorithm algorithm block diagram；

Fig. 3 be the embodiment of the present invention sequence to sequence algorithm combination attention mechanism algorithm block diagram；

Fig. 4 is the express waybill image of the embodiment of the present invention.

Specific implementation mode

The present invention is specifically described with reference to specific embodiment, but is not limited to the specific embodiment.

The preferred embodiment of the present invention is that the Address Recognition of express waybill is rectified a deviation, with the ground of express waybill in following embodiment The present invention is specifically described in location identification correction.

The present embodiment carries out papery waybill real-time by integrated use artificial intelligence, big data and geography information Digitization is realized to waybill generating date and application.For the OCR technique problem relatively low to the discrimination of hand-written waybill, The Address Recognition deviation-rectifying system and method for the present embodiment improve the OCR technique identification lower disadvantage of word sequence precision accuracy rate End.

Fig. 1 is the identification deviation-rectifying system of the embodiment of the present invention and the overall procedure block diagram of method.

Identification deviation-rectifying system as shown in Figure 1 carries out calculation system to papery document information, including：Image obtains single First S1, for obtaining papery document image；Recognition unit S2, the word in the document image, obtains document for identification Recognition result data set；Information correction cell S 3, for rectifying a deviation to the recognition result data set, the list after being rectified a deviation It is believed that breath；Data storage cell S4, the recognition result data for storing information database, the document image, the document Document information after collection and the correction.Wherein, described information database is the training data of described information correction unit.

The method of Address Recognition error correction is as follows：

Step 1：A piece of paper matter waybill image is obtained by image acquisition unit S1；

Step 2：The papery waybill image is handled by character recognition technology using recognition unit S2, with obtaining waybill The recognition result data set of location；

Step 3：Cell S 3 is rectified a deviation to recognition result data set progress error correction and completion, after obtaining error correction by information Waybill address information；

Step 4：Complete address information database is stored in advance in data storage cell S4, and to the waybill image, Waybill address information after the single address recognition result data set of the fortune and error correction is stored.

Recognition unit S2 preferably uses OCR technique to carry out Text region to waybill picture.

Complete address information database is previously stored in data storage cell S4, which is by being gone through to waybill Local whole address informations obtained from the integration of history big data and geography information, and according to new geography information constantly into Row update.When information rectifies a deviation cell S 3 to transporting single address recognition result data set progress error correction, full address letter can be compared It ceases database and completion is carried out to the waybill address information of error correction.

Papery waybill picture, the recognition unit S2 of image acquisition unit S1 acquisitions obtain transporting single address recognition result data Waybill address information after collection and information correction 3 error correction of cell S is also respectively stored into data storage cell S4.

Information correction cell S 3 carries out error correction using sequence to sequence algorithm combination attention mechanism to address information.

Fig. 2 is algorithm block diagram of the sequence of the embodiment of the present invention to sequence algorithm.

The main think of that sequence is solved the problems, such as to sequence (sequence to sequence) algorithm (hereinafter referred to as Seq2Seq) Road is that one sequence as input is mapped as a sequence as output, this mistake by deep neural network model Journey is made of two links of coding input and coding output.As shown in Figure 2, existing sequence " A B C<EOS>", it is passed to successively After model, mapping output is " W X Y Z EOS ".

The information correction cell S 3 of the embodiment of the present invention includes encoder Ec and decoder Dc, and the encoder is by the knowledge Other result data collection carries out calculating ratio pair with described information database, and the decoder entangles result of calculation mapping output to be described Document information to the rear.Wherein, encoder Ec is the coding input part of Seq2Seq algorithms, and decoder Dc is Seq2Seq algorithms Decoding output par, c, as shown in Fig. 2, left side picture frame Ec belongs to the parts encoder Ec, the right picture frame Dc belongs to the portions decoder Dc Point, specific calculating process is as follows：

It is coding link first, Seq2Seq algorithm models receive the input x at each moment_t, the output with last moment h_t-1Enter in model together and carry out coding calculating, obtains the output h at current time_t, formula is as follows：

h_t=f (x_t,h_t-1) ⑴

Wherein, x_tFor the input at each moment, " A ", " B ", " C " being equivalent in Fig. 2；

h_t-1For the state of the coded portion of last moment；

h_tFor the state of the coded portion at current time；

F is coding or the function of decoded neural network output par, c.

Encoder Ec calculates state updating section by coding and obtains context vector c, in conjunction with the h generated every time_t, according to such as Context vector c is calculated in lower formula.

C=q ({ h₁,...,h_TX}) ⑵

Wherein, q is the function of neural network state updating section.

Seq2Seq algorithm models enter decoding link, the state s of last moment_i-1, in conjunction with the output of last moment y_i-1, in conjunction with the context vector c that coding is calculated, current state s can be obtained in decoding_i, formula is as follows：

s_i=f (s_i-1,y_i-1,c) ⑶

Wherein, s_i-1For the state of the decoded portion of last moment；

s_iFor the state of the decoded portion at current time；

y_i-1It is exported for the decoding of last moment, " W ", " X ", " Y ", " Z " being equivalent in Fig. 2.

Decoder Dc is further according to current state s_i, context vector c, the output y of last moment_i-1, when decoding obtains x, this The output y at moment_iAccount for the probability distribution of all outputs, wherein y_iAs decoded result.

p(y_i|y₁,...y_i-1, x) and=g (y_i-1,s_i,c) ⑷

But the machine translation of waybill is only carried out with Seq2Seq models, there can be same context vector c and act on Entire decoding process directly results in decoding result distortion.

Therefore, the present invention introduces attention (attention) mechanism (hereinafter referred to as on the basis of Seq2Seq models Attention mechanism).

In order to preferably utilize context vector c, we are by the input recognition result data set of context vector c and each moment The associated context vector as weighting.

Fig. 3 be the embodiment of the present invention sequence to sequence algorithm combination attention mechanism algorithm block diagram.

Context vector c is calculated according to following formula_i, wherein α_ijIt is to pay attention to force parameter, h_jIt is the coding output at each moment, T_xFor the length of input vector.

For above-mentioned formula, α_ijIt is calculated according to following formula：

Wherein, e_ijAnd e_ikIt is the intermediate variable that attention mechanism calculates, is calculated according to following formula：

e_ij=a (s_i-1,h_j) ⑺

Wherein, a is the neural network unit for calculating attention mechanism.

The context vector c that will be calculated again_iIt is updated in above-mentioned seq2seq algorithm models and corresponding coding is calculated Decoding output, as shown in figure 3, X1, X2 ..., Xt be each moment input, Y1, Y2 ..., Yt be each moment output, C1, c2 ..., ct be corresponding context vector of each moment.

The above-mentioned address correction algorithm model based on Seq2Seq algorithm combination attention mechanism is applied to actual In hand-written waybill identification scene, operation is as follows：

First, an express waybill is obtained, as shown in figure 4, in the original waybill picture, R1 is the hand of sender address Region R1 is write, R2 is the handwriting area R2 of address of the addressee.

Then, which is handled by OCR identification technologies, obtains the recognition result data set that OCR is returned (one of such as OCR result：Longhua Jie Gu power You Hu states east).

The recognition result will be passed to the computation model of Seq2Seq combination attention mechanism one by one as list entries, After model based coding decodes, the highest correction result of confidence level is obtained (such as：Shenzhen City, Guangdong Province Longhua new district Guanlan street silicon The gardens paddy power Qing Hu).

It can be seen that the correction result can not only accurately identify original order data, but also can be carried out to address Error correction completion.After the correction algorithm of the Seq2Seq combination attention mechanism, the recognition accuracy of hand-written waybill obtains Great raising.

Above example is specifically described the present invention using the Address Recognition correction of express waybill, the present invention not office It is limited to the Address Recognition correction of express waybill, for needing the other field for carrying out Text region correction to document to be suitable for this Invention.

Above example is the preferred embodiment of the present invention, all the present invention's not to limit the purpose of the present invention The modification and replacement carried out within spirit and principle, within the protection of the present invention.

Claims

1. a kind of identification deviation-rectifying system carries out calculation system to papery document information, which is characterized in that including：

Image acquisition unit, for obtaining papery document image；

Recognition unit, the word in the document image, obtains the identification text sequence of document for identification；

Information correction unit, for rectifying a deviation to identification text sequence, the document information after being rectified a deviation.

2. identification deviation-rectifying system according to claim 1, further includes data storage cell, for store information database, The document image, the identification text sequence of the document and the document information after the correction,

Wherein, described information database is the training data of described information correction unit.

3. identification deviation-rectifying system according to claim 2, described information rectifies a deviation unit using sequence to sequence algorithm model The identification text sequence of input is mapped as to the text sequence of output.

4. identification deviation-rectifying system according to claim 3, the sequence to sequence algorithm model includes encoder and decoding The identification text sequence is compared by device, the encoder with described information database, and the decoder is by comparison result Mapping output is the document information after the correction.

5. identification deviation-rectifying system according to claim 4, the encoder calculates the state of text sequence more by coding New portion obtains context vector, and the decoder obtains the document after the correction according to current state and context vector decoding to be believed Breath.

6. identification deviation-rectifying system according to claim 5, described information rectifies a deviation model using sequence to sequence algorithm model The identification text sequence of input is mapped as to the text sequence of output in conjunction with attention mechanism.

7. identification deviation-rectifying system according to claim 6, the attention mechanism is by the context vector and each moment The associated context vector as weighting of identification text sequence, the decoder decodes to obtain the document letter after the correction Breath.

8. identification deviation-rectifying system according to claim 1, the recognition unit is using OCR technique to document image into style of writing Word identifies.

9. identifying that deviation-rectifying system, the papery document information are express waybill address according to claim 1-8 any one of them Information.

10. identification deviation-rectifying system according to claim 9, described information database is that big data and geography information are integrated The complete address information training data arrived.

11. identification deviation-rectifying system according to claim 10, described information rectifies a deviation unit by the recognition result data set It is compared with the complete address information training data, error correction and completion is carried out to express waybill address, mapping output has been Whole address information.

12. a kind of identification method for correcting error carries out calculation system, which is characterized in that include the following steps to papery document information：

Papery document image is obtained by image acquisition unit；

Character recognition technology processing is carried out to the document image with recognition unit, obtains the identification text sequence of document；

It is rectified a deviation to the identification text sequence by information correction unit, the document information after being rectified a deviation.

13. identification method for correcting error according to claim 12 further includes that information database is prestored to data storage Unit, and the step that the document information after recognition result data set to document image, the document and the correction is stored Suddenly,