CN107463928A - Word sequence error correction algorithm, system and its equipment based on OCR and two-way LSTM - Google Patents

Word sequence error correction algorithm, system and its equipment based on OCR and two-way LSTM Download PDF

Info

Publication number
CN107463928A
CN107463928A CN201710630581.0A CN201710630581A CN107463928A CN 107463928 A CN107463928 A CN 107463928A CN 201710630581 A CN201710630581 A CN 201710630581A CN 107463928 A CN107463928 A CN 107463928A
Authority
CN
China
Prior art keywords
ocr
way lstm
linguistic context
error correction
context vector
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201710630581.0A
Other languages
Chinese (zh)
Inventor
王志成
邝展豪
高磊
刘志欣
王亮
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
SF Technology Co Ltd
SF Tech Co Ltd
Original Assignee
SF Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by SF Technology Co Ltd filed Critical SF Technology Co Ltd
Priority to CN201710630581.0A priority Critical patent/CN107463928A/en
Publication of CN107463928A publication Critical patent/CN107463928A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/98Detection or correction of errors, e.g. by rescanning the pattern or by human intervention; Evaluation of the quality of the acquired patterns
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V20/00Scenes; Scene-specific elements
    • G06V20/60Type of objects
    • G06V20/62Text, e.g. of license plates, overlay texts or captions on TV images
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition

Abstract

Based on OCR and two-way LSTM word sequence error correction algorithm, system and its equipment, methods described includes:S1, obtain character image;S2, the character image pre-process to obtain First ray set X={ x by OCR0,x1,...,xm};S3 the, by { x of positive sequence0,x1,…,xmAnd inverted sequence { xm,xm‑1,...,x0Input two-way LSTM structure encoder in obtain linguistic context vector c;S4, the decoder decoding that the linguistic context vector c is built through two-way LSTM obtain the second arrangement set Y respectively.The system includes image capture module, OCR processing modules, the encoder of two-way LSTM structures, the decoder of two-way LSTM structures.The equipment is used for the configuration processor for carrying methods described.

Description

Word sequence error correction algorithm, system and its equipment based on OCR and two-way LSTM
Technical field
The present invention relates to machine translation field in pictograph identification process, more particularly to based on OCR's and two-way LSTM Word sequence error correction algorithm, system and its equipment.
Background technology
In recent years, as the fast development of machine learning, various machine translation algorithms emerge in an endless stream, what is be widely used has OCR Text region algorithms.OCR (Optical Character Recognition, optical character identification) refers to electronic equipment (such as scanner or digital camera) checks the character printed on paper, and its shape, Ran Houyong are determined by detecting dark, bright pattern Character identifying method translates into shape the process of computword;That is, for printed character, using optical mode by paper Text conversion in matter document turns into the image file of black and white lattice, and by identification software that the text conversion in image is written This form, the technology further edited and processed for word processor.
However, because image irradiation, angle etc. influence, OCR identifications word arithmetic accuracy is extremely difficult to it is expected.
The content of the invention
In order to solve the above-mentioned technical problem, the present invention proposes the word sequence error correction algorithm based on OCR and two-way LSTM.System System and its equipment, it can effectively improve the degree of accuracy of word sequence identification.
To achieve these goals, the technical scheme is that:
Based on OCR and two-way LSTM word sequence error correction algorithm, the identification of word suitable for image, including step:
S1, obtain character image;
S2, the character image pre-process to obtain First ray set X={ x by OCR0,x1,...,xm};
S3 the, by { x of positive sequence0,x1,…,xmAnd inverted sequence { xm,xm-1,...,x0Input the coding that two-way LSTM is built Linguistic context vector c is obtained in device;
S4, the decoder decoding that the linguistic context vector c is built through two-way LSTM obtain the second arrangement set Y respectively.
Linguistic context vector c described in step S3 is:
C=Φ ({ h1,h2,…,hTS});
ht=f (xt,ht-1)。
The second arrangement set Y described in step S4 is:
Y=(y0,y1,…,yn);
st=f (yt-1,st-1,c);
p(yt|y<T, X)=g (yt-1,st,c)。
Character image described in step S1 is express delivery single image.
The threshold value of OCR pretreatments described in step S2 is the minimum reliability threshold values that system allows.
Based on OCR and two-way LSTM word sequence error correction system, including:
Image capture module, for obtaining character image;
OCR processing modules, pre-process to obtain First ray set X={ x for carrying out OCR to the character image0, x1,...,xm};
The encoder of two-way LSTM structures the, for { x to positive sequence0,x1,…,xmAnd inverted sequence { xm,xm-1,...,x0} Encoded to obtain linguistic context vector c;
The decoder of two-way LSTM structures, the second arrangement set is obtained for being decoded to the linguistic context vector c respectively Y。
Based on OCR and two-way LSTM word sequence error correction apparatus, including it is stored with the computer-readable of computer program Medium, described program are run for performing:
S1, obtain character image;
S2, the character image pre-process to obtain First ray set X={ x by OCR0,x1,...,xm};
S3 the, by { x of positive sequence0,x1,…,xmAnd inverted sequence { xm,xm-1,...,x0Input the coding that two-way LSTM is built Linguistic context vector c is obtained in device;
S4, the decoder decoding that the linguistic context vector c is built through two-way LSTM obtain the second arrangement set Y respectively.
The beneficial effects of the invention are as follows:By integrated use OCR and two-way LSTM algorithms, the accurate of Text region is improved Degree.
Brief description of the drawings
Fig. 1 shows the flow chart according to embodiments herein.
Fig. 2 shows the operational flowchart of the two-way LSTM according to embodiments herein;
Fig. 3 shows the coding flow chart of the two-way LSTM according to embodiments herein.
Embodiment
In order to be better understood by technical scheme, the invention will be further described by 1- Fig. 3 below in conjunction with the accompanying drawings.
As shown in figure 1, the word sequence error correction algorithm based on OCR and two-way LSTM, the identification of word suitable for image, Integrated use artificial intelligence and big data, the text queue to input carry out real time data, realized to the real-time of text information Processing and application.Including step:
Obtain character image and carry out OCR pretreatments.
It is originally inputted as express delivery single image information, is pre-processed via Text region OCR, obtained OCR result queue, OCR Input of the result queue as Language Model, and combine mass text dictionary, obtain desired output sequence.
In order to improve the disadvantage of OCR technique identification word sequence precision accuracy rate relatively low (Exemplary statistical data 29.65%) End, using the method for setting minimum reliability threshold values, the value that will be greater than the threshold value takes out as OCR most this algorithm Whole output character queue, is input in language model and carries out computing.
Because the Recognition with Recurrent Neural Network RNN of the standard contextual informations that can be accessed are limited in scope, cause the defeated of hidden layer Entering the output for network influences to weaken with the continuous recurrence of network.As shown in Fig. 2 to solve this problem, by double To LSTM models (length memory network) using one as input sequence mapping for one as export sequence, this process It is made up of two links of coding input and coding output.Such as existing sequence " x0,x1,...,xm", after being passed to model successively, reflect It is " y to penetrate output0,y1,…,yn”。
Two-way LSTM core frame is Encoder-Decoder.From the point of view of simple, after list entries is passed to model, first The vector of a regular length, i.e. linguistic context vector are compiled it as by encoder.After the completion of coding, linguistic context variable will enter solution Code device is decoded, and by using local optimum resolving Algorithm, chooses a kind of module, the retrieval dictionary before equipment exports, from And obtain optimal selection.
From the point of view of specific, for given input First ray set X, expectation is generated by Encoder-Decoder frameworks The second arrangement set of target Y.X, Y are made up of respective sequence respectively.
X={ x0,x1,...,xm, its order is character string order in itself;
Y=(y0,y1,…,yn)。
M herein and n is positive integer, and m is that length -1, n of list entries is length -1 of output sequence, and wherein m and n is not It is certain equal, stop output when decoder Decoder end of output symbols.First, shown in equation below, list entries {x0,x1,…,xmAnd inverted sequence { xm,xm-1,...,x0Via two-way LSTM structure encoder recurrence obtains each hidden section one by one Point ht, each hidden node htWeighted sum be linguistic context vector c.The concept of the hidden node is:Input is removed in neutral net And all nodes of output node can be referred to as hidden node, more should accurately be changed to " linguistic context caused by each moment to Amount ".Fig. 3 is that two-way LSTM encodes to obtain c1, c2 process.Wherein c1, c2 are two linguistic context vectors, represent respectively positive sequence with And backward.
ht=f (xt,ht-1)
C=Φ ({ h1,h2,…,hTS})
Wherein h refers to the linguistic context vector of each moment encoder output, and TS refers to last moment.Φ refers to the h at all moment Pass through the stacking fusion process at each moment on the encoder.F refer to encoder a moment according to last moment linguistic context to Amount and input produce the function (process) of current time linguistic context vector,.
Linguistic context the vector c1, c2 of positive sequence backward coding generation are after the completion of coding, by merging (being usually direct splicing), Final linguistic context vector as encoder is input to decoder, obtains ultimate sequence set Y, the output sequence as needed.
st=f (yt-1,st-1,c);
p(yt|y<T, X)=g (yt-1,st,c)。
Wherein, s refers to linguistic context vector caused by each moment decoder.F refers to decoder at current time according to last moment Function (the mistake for the linguistic context vector structure current time linguistic context vector that decoder linguistic context vector, output and encoder finally export Journey).G refers to decoder and finally exported according to current time decoder linguistic context vector, the output of last moment decoder and encoder Linguistic context vector, produce the process that currently exports.Wherein p is represented and next output is produced on the premise of all inputs before Probability;X refers to the input dictionary vector at each moment that encoder receives.Parameter t in above-mentioned is the moment, and value is: T value is 0≤t≤m, in a decoder t 0≤t of value≤n in encoder.
Based on OCR and two-way LSTM word sequence error correction system, including:
Image capture module, for obtaining character image;
OCR processing modules, pre-process to obtain First ray set X={ x for carrying out OCR to the character image0, x1,...,xm};
The encoder of two-way LSTM structures the, for { x to positive sequence0,x1,…,xmAnd inverted sequence { xm,xm-1,...,x0} Encoded to obtain linguistic context vector c;
The decoder of two-way LSTM structures, the second arrangement set is obtained for being decoded to the linguistic context vector c respectively Y。
Based on OCR and two-way LSTM word sequence error correction apparatus, including it is stored with the computer-readable of computer program Medium, described program are run for performing:
S1, obtain character image;
S2, the character image pre-process to obtain First ray set X={ x by OCR0,x1,...,xm};
S3 the, by { x of positive sequence0,x1,…,xmAnd inverted sequence { xm,xm-1,...,x0Input the coding that two-way LSTM is built Linguistic context vector c is obtained in device;
S4, the decoder decoding that the linguistic context vector c is built through two-way LSTM obtain the second arrangement set Y respectively.
Above description is only the preferred embodiment of the application and the explanation to institute's application technology principle.People in the art Member should be appreciated that invention scope involved in the application, however it is not limited to the technology that the particular combination of above-mentioned technical characteristic forms Scheme, while should also cover in the case where not departing from the inventive concept, carried out by above-mentioned technical characteristic or its equivalent feature The other technical schemes for being combined and being formed.Such as features described above has similar work(with (but not limited to) disclosed herein The technical scheme that the technical characteristic of energy is replaced mutually and formed.

Claims (7)

1. based on OCR and two-way LSTM word sequence error correction algorithm, the identification of word suitable for image, it is characterised in that Including step:
S1, obtain character image;
S2, the character image pre-process to obtain First ray set X={ x by OCR0,x1,...,xm};
S3 the, by { x of positive sequence0,x1,…,xmAnd inverted sequence { xm,xm-1,...,x0Input in the encoder of two-way LSTM structures Obtain linguistic context vector c;
S4, the decoder decoding that the linguistic context vector c is built through two-way LSTM obtain the second arrangement set Y respectively.
2. the word sequence error correction algorithm according to claim 1 based on OCR and two-way LSTM, it is characterised in that step Linguistic context vector c described in S3 is:
C=Φ ({ h1,h2,…,hTS});
ht=f (xt,ht-1)。
3. the word sequence error correction algorithm according to claim 1 based on OCR and two-way LSTM, it is characterised in that step The second arrangement set Y described in S4 is:
Y=(y0,y1,…,yn);
st=f (yt-1,st-1,c);
p(yt|y<T, X)=g (yt-1,st,c)。
4. the word sequence error correction algorithm based on OCR and two-way LSTM according to Claims 2 or 3, it is characterised in that step Character image described in rapid S1 is express delivery single image.
5. the word sequence error correction algorithm based on OCR and two-way LSTM according to Claims 2 or 3, it is characterised in that step The threshold value of OCR pretreatments described in rapid S2 is the minimum reliability threshold values that system allows.
6. the word sequence error correction system based on OCR and two-way LSTM, it is characterised in that including:
Image capture module, for obtaining character image;
OCR processing modules, pre-process to obtain First ray set X={ x for carrying out OCR to the character image0,x1,..., xm};
The encoder of two-way LSTM structures the, for { x to positive sequence0,x1,…,xmAnd inverted sequence { xm,xm-1,...,x0Carry out Coding obtains linguistic context vector c;
The decoder of two-way LSTM structures, the second arrangement set Y is obtained for being decoded to the linguistic context vector c respectively.
7. based on OCR and two-way LSTM word sequence error correction apparatus, including it is stored with computer-readable Jie of computer program Matter, it is characterised in that described program is run for performing:
S1, obtain character image;
S2, the character image pre-process to obtain First ray set X={ x by OCR0,x1,...,xm};
S3 the, by { x of positive sequence0,x1,…,xmAnd inverted sequence { xm,xm-1,...,x0Input in the encoder of two-way LSTM structures Obtain linguistic context vector c;
S4, the decoder decoding that the linguistic context vector c is built through two-way LSTM obtain the second arrangement set Y respectively.
CN201710630581.0A 2017-07-28 2017-07-28 Word sequence error correction algorithm, system and its equipment based on OCR and two-way LSTM Pending CN107463928A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710630581.0A CN107463928A (en) 2017-07-28 2017-07-28 Word sequence error correction algorithm, system and its equipment based on OCR and two-way LSTM

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710630581.0A CN107463928A (en) 2017-07-28 2017-07-28 Word sequence error correction algorithm, system and its equipment based on OCR and two-way LSTM

Publications (1)

Publication Number Publication Date
CN107463928A true CN107463928A (en) 2017-12-12

Family

ID=60547822

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710630581.0A Pending CN107463928A (en) 2017-07-28 2017-07-28 Word sequence error correction algorithm, system and its equipment based on OCR and two-way LSTM

Country Status (1)

Country Link
CN (1) CN107463928A (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108416349A (en) * 2018-01-30 2018-08-17 顺丰科技有限公司 Identify deviation-rectifying system and method
CN109711412A (en) * 2018-12-27 2019-05-03 信雅达系统工程股份有限公司 A kind of optical character identification error correction method based on dictionary
CN110377591A (en) * 2019-06-12 2019-10-25 北京百度网讯科技有限公司 Training data cleaning method, device, computer equipment and storage medium
CN112507080A (en) * 2020-12-16 2021-03-16 北京信息科技大学 Character recognition and correction method
WO2021164310A1 (en) * 2020-02-21 2021-08-26 华为技术有限公司 Text error correction method and apparatus, and terminal device and computer storage medium
US11842524B2 (en) 2021-04-30 2023-12-12 International Business Machines Corporation Multi-modal learning based intelligent enhancement of post optical character recognition error correction

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150161991A1 (en) * 2013-12-10 2015-06-11 Google Inc. Generating representations of acoustic sequences using projection layers
CN105046289A (en) * 2015-08-07 2015-11-11 北京旷视科技有限公司 Text field type identification method and text field type identification system
CN105512692A (en) * 2015-11-30 2016-04-20 华南理工大学 BLSTM-based online handwritten mathematical expression symbol recognition method
CN105740226A (en) * 2016-01-15 2016-07-06 南京大学 Method for implementing Chinese segmentation by using tree neural network and bilateral neural network
CN105930314A (en) * 2016-04-14 2016-09-07 清华大学 Text summarization generation system and method based on coding-decoding deep neural networks
CN106604125A (en) * 2016-12-29 2017-04-26 北京奇艺世纪科技有限公司 Video subtitle determining method and video subtitle determining device
CN106960206A (en) * 2017-02-08 2017-07-18 北京捷通华声科技股份有限公司 Character identifying method and character recognition system

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150161991A1 (en) * 2013-12-10 2015-06-11 Google Inc. Generating representations of acoustic sequences using projection layers
CN105046289A (en) * 2015-08-07 2015-11-11 北京旷视科技有限公司 Text field type identification method and text field type identification system
CN105512692A (en) * 2015-11-30 2016-04-20 华南理工大学 BLSTM-based online handwritten mathematical expression symbol recognition method
CN105740226A (en) * 2016-01-15 2016-07-06 南京大学 Method for implementing Chinese segmentation by using tree neural network and bilateral neural network
CN105930314A (en) * 2016-04-14 2016-09-07 清华大学 Text summarization generation system and method based on coding-decoding deep neural networks
CN106604125A (en) * 2016-12-29 2017-04-26 北京奇艺世纪科技有限公司 Video subtitle determining method and video subtitle determining device
CN106960206A (en) * 2017-02-08 2017-07-18 北京捷通华声科技股份有限公司 Character identifying method and character recognition system

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
商俊蓓: "基于双向长短时记忆递归神经网络的联机手写数字公式字符识别", 《中国优秀硕士学位论文全文数据库信息科技辑》 *

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108416349A (en) * 2018-01-30 2018-08-17 顺丰科技有限公司 Identify deviation-rectifying system and method
CN109711412A (en) * 2018-12-27 2019-05-03 信雅达系统工程股份有限公司 A kind of optical character identification error correction method based on dictionary
CN110377591A (en) * 2019-06-12 2019-10-25 北京百度网讯科技有限公司 Training data cleaning method, device, computer equipment and storage medium
WO2021164310A1 (en) * 2020-02-21 2021-08-26 华为技术有限公司 Text error correction method and apparatus, and terminal device and computer storage medium
CN112507080A (en) * 2020-12-16 2021-03-16 北京信息科技大学 Character recognition and correction method
US11842524B2 (en) 2021-04-30 2023-12-12 International Business Machines Corporation Multi-modal learning based intelligent enhancement of post optical character recognition error correction

Similar Documents

Publication Publication Date Title
CN107463928A (en) Word sequence error correction algorithm, system and its equipment based on OCR and two-way LSTM
Jiang et al. Learning to guide decoding for image captioning
CN110738090A (en) System and method for end-to-end handwritten text recognition using neural networks
CN112084841B (en) Cross-mode image multi-style subtitle generating method and system
CN112528637B (en) Text processing model training method, device, computer equipment and storage medium
CN112115687B (en) Method for generating problem by combining triplet and entity type in knowledge base
CN111143563A (en) Text classification method based on integration of BERT, LSTM and CNN
CN111914076B (en) User image construction method, system, terminal and storage medium based on man-machine conversation
CN111581970B (en) Text recognition method, device and storage medium for network context
CN114820871B (en) Font generation method, model training method, device, equipment and medium
CN112070114A (en) Scene character recognition method and system based on Gaussian constraint attention mechanism network
CN115082693A (en) Multi-granularity multi-mode fused artwork image description generation method
CN116884391B (en) Multimode fusion audio generation method and device based on diffusion model
CN112560456A (en) Generation type abstract generation method and system based on improved neural network
CN114863539A (en) Portrait key point detection method and system based on feature fusion
CN116206314A (en) Model training method, formula identification method, device, medium and equipment
Agrawal et al. Image Caption Generator Using Attention Mechanism
CN112349288A (en) Chinese speech recognition method based on pinyin constraint joint learning
CN115064154A (en) Method and device for generating mixed language voice recognition model
CN113297374B (en) Text classification method based on BERT and word feature fusion
CN114694255A (en) Sentence-level lip language identification method based on channel attention and time convolution network
CN116702765A (en) Event extraction method and device and electronic equipment
CN115496134A (en) Traffic scene video description generation method and device based on multi-modal feature fusion
CN112434143B (en) Dialog method, storage medium and system based on hidden state constraint of GRU (generalized regression Unit)
CN115719072A (en) Chapter-level neural machine translation method and system based on mask mechanism

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20171212

RJ01 Rejection of invention patent application after publication