CN109858037A

CN109858037A - A kind of pair of OCR recognition result carries out the method and system of structuring output

Info

Publication number: CN109858037A
Application number: CN201910145824.0A
Authority: CN
Inventors: 闫铮; 杨恒杰
Original assignee: Huaqiao University
Current assignee: Huaqiao University
Priority date: 2019-02-27
Filing date: 2019-02-27
Publication date: 2019-06-07

Abstract

The invention discloses the method and system that a kind of pair of OCR recognition result carries out structuring output, and method includes: to be identified using ID Card Image of the OCR to acquisition, carry out processing to recognition result and obtain text sequence；Entity recognition is named to the text sequence by trained name physical model, the name solid data of tape label is obtained, exports corresponding structured text.A kind of pair of OCR recognition result of the present invention carries out the method and system of structuring output, pass through the text of a large amount of tape label of generation, training obtains Named Entity Extraction Model, each entity in OCR recognition result can fast and efficiently be extracted, the output of structuring is obtained, very big help is brought to the typing of identity information.

Description

A kind of pair of OCR recognition result carries out the method and system of structuring output

Technical field

The present invention relates to pictographs to identify field, and in particular to a kind of pair of OCR recognition result carries out structuring output Method and system.

Background technique

OCR (Optical Character Recognition, optical character identification) technology is mainly by the text in image Word is identified as editable character string.What the OCR technique of early stage mainly identified is some simple file and pictures, due to depth The development of habit, current OCR technique have been widely used for the Text region of image under various complex scenes.

In recent years, perfecting with system, more and more occasions need us to carry out system of real name, based on OCR technique Development, usually we only need to shoot or image of the upload containing identity card, system carry out image using OCR technique Identification, the result that then will identify that carry out typing.However, the result that OCR technique identifies is only a string of editable words Symbol string, does not include any structured message.Generally require to establish series of rules screening items for result to typing, or Direct labor's typing.The former robustness is very poor, and can not establish a set of complete rule to carry out the screening of every terms of information.Afterwards Person's low efficiency causes the waste of great human cost.In addition, the information for being in same level direction for text on picture can The case where recognition result dislocation can occur.

Summary of the invention

It is a primary object of the present invention to propose that a kind of pair of OCR recognition result carries out the method and system of structuring output, Each name entity in OCR recognition result can be fast and efficiently extracted, the output of structuring is obtained, to identity information Typing bring very big help.

The present invention adopts the following technical scheme:

On the one hand, the method that a kind of pair of OCR recognition result of the present invention carries out structuring output, comprising:

It is identified using ID Card Image of the OCR to acquisition, processing is carried out to recognition result and obtains text sequence；

Entity recognition is named to the text sequence by trained name physical model, obtains the life of tape label Name solid data, exports corresponding structured text.

Preferably, the training method of physical model is named, comprising:

A), the sample text sequence of several tape labels is generated；

Each sample text sequence includes the name entity of name, gender, nationality, birth, address and citizenship number, Respectively name entity, sexual entities, national entity, birth entity, address entity and citizenship number entity；Each sample Text sequence further include ' name ', ' gender ', ' nationality ', ' birth ', ' address ' and ' citizenship number ' text items；Its In, ' name ' is corresponding with name entity, and ' gender ' is corresponding with sexual entities, and ' nationality ' is corresponding with national entity, ' out It is raw ' it is corresponding with birth entity, ' address ' is corresponding with address entity, ' citizenship number ' and citizenship number entity phase It is corresponding；

Each text items and it is each name entity start-up portion label for labelling be B-entityName, each text items and The label for labelling of each name entity other parts is I-entityName；Wherein entityName is customized character String；

B), using the sample text sequence training name physical model of the tape label.

Preferably, the entityName of the entityName of each text items and its corresponding name entity includes part phase Same character string.

Preferably, the corresponding structured text of the output, specifically:

Every a line exports the text items and its label or a name entity and its label；The text items Corresponding name entity is adjacent, and text items output is in the previous row of its corresponding name entity.

Preferably, the name physical model is two-way long short-term memory Recognition with Recurrent Neural Network Bi-LSTM and condition random The model that field CRF is combined.

Second aspect, the system that a kind of pair of OCR recognition result of the present invention carries out structuring output, comprising:

OCR output obtains module, for being identified using ID Card Image of the OCR to acquisition, carries out to recognition result Processing obtains text sequence；

Structuring output module, for being named entity to the text sequence by trained name physical model Identification, obtains the name solid data of tape label, exports corresponding structured text.

Preferably, the training method of physical model is named, comprising:

A), the sample text sequence of several tape labels is generated；

Preferably, the corresponding structured text of the output, specifically:

Compared with prior art, beneficial effects of the present invention are as follows:

(1) a kind of pair of OCR recognition result of the present invention carries out the method and system of structuring output, by ID card information Analysis, devise the text data generating algorithm of a set of tape label, the text of a large amount of tape label can be generated；

(2) a kind of pair of OCR recognition result of the present invention carries out the method and system of structuring output, passes through a large amount of of generation The text of tape label, training obtain Named Entity Extraction Model, can fast and efficiently extract each in OCR recognition result A entity obtains the output of structuring, brings very big help to the typing of identity information.

The above description is only an overview of the technical scheme of the present invention, in order to more clearly understand technology hand of the invention Section, so as to be implemented in accordance with the contents of the specification, and in order to allow above and other objects, features and advantages of the invention It can be more clearly understood, be exemplified below a specific embodiment of the invention.

According to the following detailed description of specific embodiments of the present invention in conjunction with the accompanying drawings, those skilled in the art will be brighter Above-mentioned and other purposes of the invention, advantages and features.

Detailed description of the invention

Fig. 1 is the flow chart of the method that structuring output is carried out to OCR recognition result of the embodiment of the present invention；

Fig. 2 is the recognition result with dislocation that the OCR of the embodiment of the present invention is exported；

Fig. 3 is the output of the name physical model of the embodiment of the present invention；

Fig. 4 is that the structuring of the embodiment of the present invention exports；

Fig. 5 is the structural block diagram of the system that structuring output is carried out to OCR recognition result of the embodiment of the present invention.

Specific embodiment

To make the object, technical solutions and advantages of the present invention clearer, below in conjunction with attached drawing to embodiment party of the present invention Formula is described in further detail.

Method shown in Figure 1, that a kind of pair of OCR recognition result of the present invention carries out structuring output, comprising:

In the present embodiment, the training method of physical model is named, comprising:

A), the sample text sequence of several tape labels is generated；

Each text items and it is each name entity start-up portion label for labelling be B-entityName, each text items and The label for labelling of each name entity other parts is I-entityName, and the unit marked is word, i.e., each word has One label；Wherein entityName is customized character string；

The entityName of the entityName of each text items and its corresponding name entity includes the identical word in part Symbol string.

If the label for labelling of the word ' surname ' of ' name ' start-up portion is B-name, the label for labelling of ' name ' is I-name, The label for labelling of the word ' Lee ' of corresponding name entity ' Li Ming ' start-up portion is B-e_name, and the label for labelling of ' bright ' is I-e_ name。

Specifically, name entity, sexual entities, national entity, birth entity, address are real when name physical model training The create-rule of body and citizenship number entity is as follows.

Name entity: it is generated in such a way that surname adds text random combine, label B-e_name/I-e_name；This Outside, certain limitation is arranged in the length of name, and certain ratio is arranged in name length, if length of name is no more than 4, name length Ratio for 2/3/4 is 7:2:1 etc..

Sexual entities: the ratio of male/female is 1:1, label B-e_gender/I-e_gender；

National entity: range is 56 nationalitys.Label is B-e_nation/I-e_nation；Han nationality sets with other nationalitys Set certain ratio, such as 8:2.

Birth entity: the range of year of birth is to push ahead the several years from the time grown up recently, such as be instantly 2018, label B-e_birth/I-e_birth；

Address entity: borough information is saved according to the whole nation and carries out random combine, label B-e_address/I-e_ address。

Citizenship number entity: first 14 of identification number are determined that latter four are number by address and birthdate Word random combine, label B-e_idnum/I-e_idnum.

Remaining ' name ', ' gender ', ' nationality ', ' birth ', ' address ', ' citizenship number ' this several mark respectively For ' B-name/I-name ', ' B-gender/I-gender ', ' B-nation/I-nation ', ' B-birth/I-birth ', ‘B-address/I-address’、‘B-idnum/I-idnum’。

In addition, also needing the sequence for exchanging each name entity by certain probability when naming physical model training, increasing Add the diversity for generating sample data.

It is shown in Figure 2, such as when identity card identification, the information for being in same level direction for text can go out The case where now misplacing, i.e., the sequence of text that we identify is directly obtained according to the position of text box in fact, in this way It determines each entity just very not robust according to this text box, therefore increases this name entity (text) sequence of exchanging Sample trains name physical model, enhances the robustness of model, to solve the problems, such as this rule-based insurmountable.

The text data ginseng of the tape label generated after being adjusted according to above-mentioned name entity create-rule and name entity sequence As shown in Figure 3.

Further, several pieces sample is generated after adjusting according to above-mentioned name entity create-rule and/or name entity sequence Then the data of generation are sent into the name physical model and are trained, wherein the name physical model by notebook data Specially two-way long short-term memory Recognition with Recurrent Neural Network (Bi-directional Long short-time Memory, Bi- LSTM) the model combined with condition random field (condition random field) CRF.

After the name physical model trains, ID Card Image is identified by OCR first, then recognition result is carried out Integration obtains a text sequence, and the text sequence after integration is as shown in Figure 2；The text training is directly made a gift to someone again and is trained Name physical model be named Entity recognition, the result of name physical model output is as shown in Figure 3.

In Fig. 2, recognition result is the sequence that text is determined according to the position of text box in OCR identification, is identified in OCR As a result in, sexual entities and national entity since there is dislocation in the detection of text box, i.e., ' male ' this text box ratio it is same The text box of ' gender ' of a line and ' the national Chinese ' wants low.The character order identified is caused also to misplace.

And in the output result of Fig. 3, due to outputing label for labelling, ' Chinese ' can be identified as B-e_nation (national entity), ' male ' can be identified as ' B-e_gender ' (sexual entities).

It is shown in Figure 4, it further, can be according to the text of corresponding tag identifier export structure.

It is described to export corresponding structured text, specifically:

In Fig. 4, to be text importing convenient for showing the tag replacement of entity.That is (name is real by name (name), e_name Body), gender (gender), e_gender (sexual entities), nation (nationality), e_nation (national entity), birth (out It is raw), e_birth (birth entity), address (address), e_address (address entity), idnum (citizen ID certificate number) With e_idnum (ID card No. entity).

It is shown in Figure 5, second aspect, the system that a kind of pair of OCR recognition result of the present invention carries out structuring output, packet It includes:

OCR output obtain module 501, for being identified using ID Card Image of the OCR to acquisition, to recognition result into Row processing obtains text sequence；

Structuring output module 502, for being named by trained name physical model to the text sequence Entity recognition obtains the name solid data of tape label, exports corresponding structured text.

A), the sample text sequence of several tape labels is generated；

It is described to export corresponding structured text, specifically:

The above is only a specific embodiment of the present invention, but the design concept of the present invention is not limited to this, all to utilize this Design makes a non-material change to the present invention, and should all belong to behavior that violates the scope of protection of the present invention.

Claims

1. the method that a kind of pair of OCR recognition result carries out structuring output characterized by comprising

Entity recognition is named to the text sequence by trained name physical model, the name for obtaining tape label is real Volume data exports corresponding structured text.

2. the method according to claim 1 for carrying out structuring output to OCR recognition result, which is characterized in that name is real The training method of body Model, comprising:

A), the sample text sequence of several tape labels is generated；

Each sample text sequence includes the name entity of name, gender, nationality, birth, address and citizenship number, respectively For name entity, sexual entities, national entity, birth entity, address entity and citizenship number entity；Each sample text Sequence further include ' name ', ' gender ', ' nationality ', ' birth ', ' address ' and ' citizenship number ' text items；Wherein, ' surname Name ' it is corresponding with name entity, ' gender ' is corresponding with sexual entities, and ' nationality ' is corresponding with national entity, ' birth ' and goes out Raw entity is corresponding, and ' address ' is corresponding with address entity, and ' citizenship number ' is corresponding with citizenship number entity；

The label for labelling of each text items and each name entity start-up portion is B-entityName, each text items and each The label for labelling for naming entity other parts is I-entityName；Wherein entityName is customized character string；

3. the method according to claim 2 for carrying out structuring output to OCR recognition result, which is characterized in that Mei Gewen The entityName of the entityName of this item and its corresponding name entity includes the identical character string in part.

4. the method according to claim 2 for carrying out structuring output to OCR recognition result, which is characterized in that described defeated Corresponding structured text out, specifically:

Every a line exports the text items and its label or a name entity and its label；The text items and its Corresponding name entity adjacent rows each other, and text items output is in the previous row of its corresponding name entity.

5. the method according to claim 1 for carrying out structuring output to OCR recognition result, which is characterized in that the life Name physical model is the model that two-way long short-term memory Recognition with Recurrent Neural Network Bi-LSTM is combined with condition random field CRF.

6. the system that a kind of pair of OCR recognition result carries out structuring output characterized by comprising

OCR output is obtained module and handled for being identified using ID Card Image of the OCR to acquisition recognition result Obtain text sequence；

Structuring output module is known for being named entity to the text sequence by trained name physical model Not, the name solid data for obtaining tape label, exports corresponding structured text.

7. the system according to claim 6 for carrying out structuring output to OCR recognition result, which is characterized in that name is real The training method of body Model, comprising:

A), the sample text sequence of several tape labels is generated；

8. the system according to claim 7 for carrying out structuring output to OCR recognition result, which is characterized in that Mei Gewen The entityName of the entityName of this item and its corresponding name entity includes the identical character string in part.

9. the system according to claim 7 for carrying out structuring output to OCR recognition result, which is characterized in that described defeated Corresponding structured text out, specifically:

Every a line exports the text items and its label or a name entity and its label；The text items and its Corresponding name entity is adjacent, and text items output is in the previous row of its corresponding name entity.

10. the system according to claim 6 for carrying out structuring output to OCR recognition result, which is characterized in that the life Name physical model is the model that two-way long short-term memory Recognition with Recurrent Neural Network Bi-LSTM is combined with condition random field CRF.