CN107463928A

CN107463928A - Word sequence error correction algorithm, system and its equipment based on OCR and two-way LSTM

Info

Publication number: CN107463928A
Application number: CN201710630581.0A
Authority: CN
Inventors: 王志成; 邝展豪; 高磊; 刘志欣; 王亮
Original assignee: SF Technology Co Ltd
Current assignee: SF Technology Co Ltd; SF Tech Co Ltd
Priority date: 2017-07-28
Filing date: 2017-07-28
Publication date: 2017-12-12

Abstract

Based on OCR and two-way LSTM word sequence error correction algorithm, system and its equipment, methods described includes：S1, obtain character image；S2, the character image pre-process to obtain First ray set X={ x by OCR₀,x₁,...,x_m}；S3 the, by { x of positive sequence₀,x₁,…,x_mAnd inverted sequence { x_m,x_m‑1,...,x₀Input two-way LSTM structure encoder in obtain linguistic context vector c；S4, the decoder decoding that the linguistic context vector c is built through two-way LSTM obtain the second arrangement set Y respectively.The system includes image capture module, OCR processing modules, the encoder of two-way LSTM structures, the decoder of two-way LSTM structures.The equipment is used for the configuration processor for carrying methods described.

Description

Word sequence error correction algorithm, system and its equipment based on OCR and two-way LSTM

Technical field

The present invention relates to machine translation field in pictograph identification process, more particularly to based on OCR's and two-way LSTM Word sequence error correction algorithm, system and its equipment.

Background technology

In recent years, as the fast development of machine learning, various machine translation algorithms emerge in an endless stream, what is be widely used has OCR Text region algorithms.OCR (Optical Character Recognition, optical character identification) refers to electronic equipment (such as scanner or digital camera) checks the character printed on paper, and its shape, Ran Houyong are determined by detecting dark, bright pattern Character identifying method translates into shape the process of computword；That is, for printed character, using optical mode by paper Text conversion in matter document turns into the image file of black and white lattice, and by identification software that the text conversion in image is written This form, the technology further edited and processed for word processor.

However, because image irradiation, angle etc. influence, OCR identifications word arithmetic accuracy is extremely difficult to it is expected.

The content of the invention

In order to solve the above-mentioned technical problem, the present invention proposes the word sequence error correction algorithm based on OCR and two-way LSTM.System System and its equipment, it can effectively improve the degree of accuracy of word sequence identification.

To achieve these goals, the technical scheme is that：

Based on OCR and two-way LSTM word sequence error correction algorithm, the identification of word suitable for image, including step：

S1, obtain character image；

S2, the character image pre-process to obtain First ray set X={ x by OCR₀,x₁,...,x_m}；

S3 the, by { x of positive sequence₀,x₁,…,x_mAnd inverted sequence { x_m,x_m-1,...,x₀Input the coding that two-way LSTM is built Linguistic context vector c is obtained in device；

S4, the decoder decoding that the linguistic context vector c is built through two-way LSTM obtain the second arrangement set Y respectively.

Linguistic context vector c described in step S3 is：

C=Φ ({ h₁,h₂,…,h_TS})；

h_t=f (x_t,h_t-1)。

The second arrangement set Y described in step S4 is：

Y=(y₀,y₁,…,y_n)；

s_t=f (y_t-1,s_t-1,c)；

p(y_t|y<T, X)=g (y_t-1,s_t,c)。

Character image described in step S1 is express delivery single image.

The threshold value of OCR pretreatments described in step S2 is the minimum reliability threshold values that system allows.

Based on OCR and two-way LSTM word sequence error correction system, including：

Image capture module, for obtaining character image；

OCR processing modules, pre-process to obtain First ray set X={ x for carrying out OCR to the character image₀, x₁,...,x_m}；

The encoder of two-way LSTM structures the, for { x to positive sequence₀,x₁,…,x_mAnd inverted sequence { x_m,x_m-1,...,x₀} Encoded to obtain linguistic context vector c；

The decoder of two-way LSTM structures, the second arrangement set is obtained for being decoded to the linguistic context vector c respectively Y。

Based on OCR and two-way LSTM word sequence error correction apparatus, including it is stored with the computer-readable of computer program Medium, described program are run for performing：

S1, obtain character image；

The beneficial effects of the invention are as follows：By integrated use OCR and two-way LSTM algorithms, the accurate of Text region is improved Degree.

Brief description of the drawings

Fig. 1 shows the flow chart according to embodiments herein.

Fig. 2 shows the operational flowchart of the two-way LSTM according to embodiments herein；

Fig. 3 shows the coding flow chart of the two-way LSTM according to embodiments herein.

Embodiment

In order to be better understood by technical scheme, the invention will be further described by 1- Fig. 3 below in conjunction with the accompanying drawings.

As shown in figure 1, the word sequence error correction algorithm based on OCR and two-way LSTM, the identification of word suitable for image, Integrated use artificial intelligence and big data, the text queue to input carry out real time data, realized to the real-time of text information Processing and application.Including step：

Obtain character image and carry out OCR pretreatments.

It is originally inputted as express delivery single image information, is pre-processed via Text region OCR, obtained OCR result queue, OCR Input of the result queue as Language Model, and combine mass text dictionary, obtain desired output sequence.

In order to improve the disadvantage of OCR technique identification word sequence precision accuracy rate relatively low (Exemplary statistical data 29.65%) End, using the method for setting minimum reliability threshold values, the value that will be greater than the threshold value takes out as OCR most this algorithm Whole output character queue, is input in language model and carries out computing.

Because the Recognition with Recurrent Neural Network RNN of the standard contextual informations that can be accessed are limited in scope, cause the defeated of hidden layer Entering the output for network influences to weaken with the continuous recurrence of network.As shown in Fig. 2 to solve this problem, by double To LSTM models (length memory network) using one as input sequence mapping for one as export sequence, this process It is made up of two links of coding input and coding output.Such as existing sequence " x₀,x₁,...,x_m", after being passed to model successively, reflect It is " y to penetrate output₀,y₁,…,y_n”。

Two-way LSTM core frame is Encoder-Decoder.From the point of view of simple, after list entries is passed to model, first The vector of a regular length, i.e. linguistic context vector are compiled it as by encoder.After the completion of coding, linguistic context variable will enter solution Code device is decoded, and by using local optimum resolving Algorithm, chooses a kind of module, the retrieval dictionary before equipment exports, from And obtain optimal selection.

From the point of view of specific, for given input First ray set X, expectation is generated by Encoder-Decoder frameworks The second arrangement set of target Y.X, Y are made up of respective sequence respectively.

X={ x₀,x₁,...,x_m, its order is character string order in itself；

Y=(y₀,y₁,…,y_n)。

M herein and n is positive integer, and m is that length -1, n of list entries is length -1 of output sequence, and wherein m and n is not It is certain equal, stop output when decoder Decoder end of output symbols.First, shown in equation below, list entries {x₀,x₁,…,x_mAnd inverted sequence { x_m,x_m-1,...,x₀Via two-way LSTM structure encoder recurrence obtains each hidden section one by one Point h_t, each hidden node h_tWeighted sum be linguistic context vector c.The concept of the hidden node is：Input is removed in neutral net And all nodes of output node can be referred to as hidden node, more should accurately be changed to " linguistic context caused by each moment to Amount ".Fig. 3 is that two-way LSTM encodes to obtain c1, c2 process.Wherein c1, c2 are two linguistic context vectors, represent respectively positive sequence with And backward.

h_t=f (x_t,h_t-1)

C=Φ ({ h₁,h₂,…,h_TS})

Wherein h refers to the linguistic context vector of each moment encoder output, and TS refers to last moment.Φ refers to the h at all moment Pass through the stacking fusion process at each moment on the encoder.F refer to encoder a moment according to last moment linguistic context to Amount and input produce the function (process) of current time linguistic context vector,.

Linguistic context the vector c1, c2 of positive sequence backward coding generation are after the completion of coding, by merging (being usually direct splicing), Final linguistic context vector as encoder is input to decoder, obtains ultimate sequence set Y, the output sequence as needed.

s_t=f (y_t-1,s_t-1,c)；

p(y_t|y<T, X)=g (y_t-1,s_t,c)。

Wherein, s refers to linguistic context vector caused by each moment decoder.F refers to decoder at current time according to last moment Function (the mistake for the linguistic context vector structure current time linguistic context vector that decoder linguistic context vector, output and encoder finally export Journey).G refers to decoder and finally exported according to current time decoder linguistic context vector, the output of last moment decoder and encoder Linguistic context vector, produce the process that currently exports.Wherein p is represented and next output is produced on the premise of all inputs before Probability；X refers to the input dictionary vector at each moment that encoder receives.Parameter t in above-mentioned is the moment, and value is： T value is 0≤t≤m, in a decoder t 0≤t of value≤n in encoder.

Image capture module, for obtaining character image；

S1, obtain character image；

Above description is only the preferred embodiment of the application and the explanation to institute's application technology principle.People in the art Member should be appreciated that invention scope involved in the application, however it is not limited to the technology that the particular combination of above-mentioned technical characteristic forms Scheme, while should also cover in the case where not departing from the inventive concept, carried out by above-mentioned technical characteristic or its equivalent feature The other technical schemes for being combined and being formed.Such as features described above has similar work(with (but not limited to) disclosed herein The technical scheme that the technical characteristic of energy is replaced mutually and formed.

Claims

1. based on OCR and two-way LSTM word sequence error correction algorithm, the identification of word suitable for image, it is characterised in that Including step：

S1, obtain character image；

S3 the, by { x of positive sequence₀,x₁,…,x_mAnd inverted sequence { x_m,x_m-1,...,x₀Input in the encoder of two-way LSTM structures Obtain linguistic context vector c；

2. the word sequence error correction algorithm according to claim 1 based on OCR and two-way LSTM, it is characterised in that step Linguistic context vector c described in S3 is：

C=Φ ({ h₁,h₂,…,h_TS})；

h_t=f (x_t,h_t-1)。

3. the word sequence error correction algorithm according to claim 1 based on OCR and two-way LSTM, it is characterised in that step The second arrangement set Y described in S4 is：

Y=(y₀,y₁,…,y_n)；

s_t=f (y_t-1,s_t-1,c)；

p(y_t|y<T, X)=g (y_t-1,s_t,c)。

4. the word sequence error correction algorithm based on OCR and two-way LSTM according to Claims 2 or 3, it is characterised in that step Character image described in rapid S1 is express delivery single image.

5. the word sequence error correction algorithm based on OCR and two-way LSTM according to Claims 2 or 3, it is characterised in that step The threshold value of OCR pretreatments described in rapid S2 is the minimum reliability threshold values that system allows.

6. the word sequence error correction system based on OCR and two-way LSTM, it is characterised in that including：

Image capture module, for obtaining character image；

OCR processing modules, pre-process to obtain First ray set X={ x for carrying out OCR to the character image₀,x₁,..., x_m}；

The encoder of two-way LSTM structures the, for { x to positive sequence₀,x₁,…,x_mAnd inverted sequence { x_m,x_m-1,...,x₀Carry out Coding obtains linguistic context vector c；

The decoder of two-way LSTM structures, the second arrangement set Y is obtained for being decoded to the linguistic context vector c respectively.

7. based on OCR and two-way LSTM word sequence error correction apparatus, including it is stored with computer-readable Jie of computer program Matter, it is characterised in that described program is run for performing：

S1, obtain character image；