CN116229494A

CN116229494A - License key information extraction method based on small sample data

Info

Publication number: CN116229494A
Application number: CN202310122860.1A
Authority: CN
Inventors: 徐亚南; 杨玲; 陆贝尔; 符宁; 李佳纬
Original assignee: Shanghai Wanda Information System Co ltd
Current assignee: Shanghai Wanda Information System Co ltd
Priority date: 2023-02-15
Filing date: 2023-02-15
Publication date: 2023-06-06

Abstract

The technical scheme of the invention is to provide a license key information extraction method based on small sample data. Then, the characteristics of texts, positions, pictures and the like are fused, input into an improved BERT model, interaction among the characteristics is learned, and the label of each text is output. Adjacent words with the same label are combined to form a field with the label. Then, the field position and the field tag information are input into the DenseRF model, and the field tags are mutually corrected according to the relative positions, angles and the like between the fields. Because the types and the positions of the fields in the license are relatively fixed, the DenseRF can achieve a better correction effect. The accurate extraction of the license key information under the condition of small sample data is realized through means of language model pre-training, multitask training, manual generation of training samples, correction based on position information and the like. Each license only needs more than ten images marked manually, so that model training can be realized.

Description

License key information extraction method based on small sample data

Technical Field

The invention relates to a key information extraction method for identifying texts by optical characters of a evidence picture, in particular to a key information extraction model training and predicting method based on small sample data, and belongs to the field of multi-mode information processing.

Background

In recent years, informatization and intelligence in government and other fields have become a trend of development. The intelligent aim in the government affairs field is to reduce the manpower spent by users and government service personnel in processing information and realize the automatic processing of the system. The license text key information based on Optical Character Recognition (OCR) can be obtained, the license picture can be converted into structured text data which is easy to process, and the method plays an important role in improving the intelligent level of a government affair system. The process of obtaining text information from a picture is generally divided into three steps, including text detection, text recognition, and key information extraction. Text detection and text recognition are generally considered OCR tasks. The key information extraction task of the license text is to extract the text information concerned by the user from the characters of the license optical character recognition, so that the research is less, and the problems of picture rotation, scaling, key character recognition deletion, printing misplacement, background interference and the like are faced.

Methods of key information extraction correlation include rule-based, template-based, learning-based methods, and the like. The rule-based method is to extract key information according to the text and text positions contained by realizing the set rule. The template-based method is to map pictures of the license, the form and the like onto a standard template, and extract information according to the position on the template. Both methods are elaborate and require a lot of manpower to construct the mapping rules. When the picture has the conditions of angular rotation, scaling and the like, an anchor point is also required to be selected, and the evidence picture and the template picture are aligned and the like. The robustness of both methods is not high, and the effect of information extraction is poor in cases where the identification is not accurate enough or the mapping is not very accurate. Each text box or each word is identified and labeled using machine learning and deep learning methods based on learning methods. This approach works well but requires a large amount of training data. In the government field, the license data are very few, and more sensitive information of individuals and enterprises is involved.

Disclosure of Invention

The invention aims to solve the technical problems that: the process of recognizing characters from license pictures and extracting key information from the characters based on OCR technology currently faces a plurality of challenges. The traditional key information extraction method based on rules and templates requires rules and the like for artificial design matching, and has complex process and poor robustness. It is difficult to cope with cases where the field name is not recognized, the printed field is offset, etc. in the picture. The learning-based method can autonomously learn the discriminated feature from the sample, but the learning-based method requires a large number of samples.

In order to solve the technical problems, the technical scheme of the invention provides a license key information extraction method based on small sample data, which is characterized by comprising the following steps:

step 1, acquiring a license picture;

step 2, adopting an OCR recognition algorithm to recognize each text box and all text contents in the license picture, and obtaining an OCR recognition result { (v) _j ，l _j ) -wherein: v _j For the text content of the jth text box, representationIs a sequence of characters; l (L) _j For the j-th text box, denoted by l _j ＝[(x1，y1)，(x2，y2)，(x3，y3)，(x4，y4)](x 1, y 1), (x 2, y 2), (x 3, y 3), and (x 4, y 4) are the four corner coordinates of the jth text box;

and 3, marking text information of the OCR recognition result by using an improved BERT model, wherein the improved BERT model comprises an embedding layer, a representation learning layer and a task layer, and the method comprises the following steps of:

will be a sequence of characters { (t) _i ，l _i ) Sequence of text fragments { (v) _i ，l _i ) Input into the embedding layer to obtain an embedded representation of each character

Embedded representation e of each text segment _i Then, adding all embedded representations to be input of a TransFormer layer;

representing a text sequence v _i The j-th character of (a) character +.>

Final embedded representation->

The following formula is shown:

wherein:

is Token information, is character +. >

Is embedded in the representation; l (L) _i ) An embedded representation of layout position information; s (S) _i ) Is an embedded representation of the clip information, isThe embedded representation of the number of the text segment where each Token information is located; p (i) is an embedded representation of position information, which is an embedded representation of absolute position information in the input sequence;

the final embedded representation e of each text segment _i The following formula is shown:

e _i ＝V(v _i )+P(i)+S(s _i )+L(l _i )

wherein: v (V) _i ) Is Token information, is used for extracting word sequence v _i An embedded representation of the image features obtained from the corresponding text line picture;

the transducer layer learns interaction information among the input features and generates a feature representation of each input;

the task layer completes various tasks according to the representation information learned by the transducer layer, and the tasks are divided into a pre-training task and a target task, wherein: the training tasks comprise a text classification task, a Token prediction task and a word segment word number prediction task; the target task is to output a label to each character, mark what field the current character belongs to, add a full connection layer based on the embedded representation output by the Transformer layer, and classify each character;

step 4, dividing the text fragments identified by the OCR algorithm into fields based on the text information labeling result obtained by the improved BERT model: selecting a label with the highest probability according to the predicted label probability output by the improved BERT model, and forming a character segment, namely a field, by using characters which are consistent and closely adjacent to the label and are identified by an OCR (optical character recognition) algorithm;

And 5, correcting the labeling results of each field by adopting a DenseRF model, integrating the fields with the same label after correcting by the DenseRF model to form a key value pair form, and taking the key value pair form a final key information extraction result, wherein the method comprises the following steps of:

step 501, constructing a DenseRF model, and defining a DenseRF model energy function, wherein the energy function is shown in the following formula:

wherein: x is x _i Representing an i-th field; psi phi type _u (x _i ) Representing a unitary energy function, ψ _u (x _i )＝-log(p(x _i ))，p(x _i ) Is field x directly output by the modified BERT model _i Probability distribution conditions of (2); psi phi type _p (x _i ，x _j ) Representing other fields x _j For field x _i The resulting binary energy function is defined as:

wherein: mu (x) _i ，x _j )＝1-p(x _i |x _j )，p(x _i |x _j ) Is a conditional probability in a known field x _j The label is l', field x _i The conditional probability that the label of l is expressed as p (x _i ＝l|x _j =l'), then there is:

p(x _i ＝l|x _j ＝l′)＝R(θ(x _i ，x _j )，d(x _i ，x _j )，l′，l)

wherein: r (θ (x) _i ，x _j )，d(x _i ，x _j ) L', l) represents a conditional probability tensor, θ and d are the angle and distance between two fields, respectively, wherein the distance between any two fields is measured in terms of word height as a basic distance unit; the conditional probability tensor R (θ (x) _i ，x _j )，d(x _i ，x _j ) And l ', l) is simply expressed as R (theta, d, l ', l), then R (theta, d, l ', l) is updated by adopting a data smoothing method, and R (theta+delta) _θ ，d+δ _d ，l′，l)←R(θ+δ _θ ，d+δ _d ，l′，l)+τ(δ _θ ，δ _d )，δ _θ 、δ _d Small deviations in angle and distance, respectively, τ (δ _θ ，δ _d ) Is a smoothed value with distance and angle increased by offset, and has 0 < τ (delta) _θ ，δ _d ) < 1, normalization of conditional probability tensorsMake Sigma _l R(θ，d，l′，l)＝1；

k ^(m) (f _i ，f _j ) Is used for measuring field x _j Sum field x _i A kernel function of the magnitude of the interaction;

w ^(m) the weight is used for fusing the results of a plurality of kernel functions;

k represents K kernel functions;

step 502, solving a DenseRF model by adopting an average field approximation method, and obtaining an optimal labeling result by minimizing an energy value, wherein the average field approximation method calculates a distribution Q (X), and minimizes a KL divergence D (Q||P) between the distribution Q (X) and labeling result P (X) joint distribution probabilities of all labels, and the distribution Q (X) updating mode is as follows:

wherein Q is _i (x _i =l) represents the probability that the i-th field is labeled as label l, Z _i Is an operation for probability value normalization;

and 6, carrying out information extraction post-processing on the key information extraction result according to the preset data type of each field.

Preferably, in step 1, after the license picture is acquired, the background except the license in the license picture is removed through a clipping function.

Preferably, in step 1, after the acquired license picture is compressed, the acquired license picture is uploaded to a server, and the server adopts the subsequent steps to process the uploaded license picture.

Preferably, in step 2, the evidence picture is preprocessed to improve the accuracy of character recognition, and then OCR recognition is performed on the evidence picture.

Preferably, in step 3, before text information labeling is performed on the OCR recognition result by using the improved BERT model, preprocessing is performed on the OCR recognition result, where the preprocessing operation includes: the position of each text, the crop text box area image, special character removal, abnormal text box and text removal are calculated.

Preferably, in step 3, the embedded representation L (L _i ) The following formula was used for calculation:

L(l _i )＝Concat(P _x (x ₁ ，x ₃ )，P _y (y ₁ ，y ₃ ))

wherein P is _x (x ₁ ，x ₃ ) P _y (y ₁ ，y ₃ ) The embedded representation, concat (·) represents the splicing operation, in the lateral and longitudinal directions, respectively.

Preferably, in step 3, the transducer layer calculates a representation vector of each input using a self-attention mechanism, wherein the self-attention calculation formula is as follows:

wherein:

representing the representation vectors corresponding to the kth and ith inputs in the k layer, wherein the representation vectors are the final embedded representation output by the embedded layer in the first layer; alpha _ij For the weight of attention, +.>

d _k Represents h _i W ^Q Dimension of vector after multiplication, h _i 、h _j To represent vectors, W ^Q 、W ^K For converting matrix, h _i W ^Q And h _j W ^K For the representation of the vector h _i 、h _j Multiplying by a conversion matrix W ^Q And W is ^K Obtaining a query and a key;

Is the normalization operation of the weights; w (W) ^V Representing the mapping matrix.

Preferably, in step 3, in the target task, ignoring the character marked as "Other" for the unimportant field in the license or the character recognized by the OCR recognition algorithm.

Preferably, the improved BERT model is trained by:

firstly, training a BERT model on a large-scale non-labeling corpus; copying parameters of the BERT model into the improved BERT model after the BERT model pre-training is completed, and using the parameters as initial values of the parameters of the improved BERT model; then, using a plurality of real license pictures, identifying characters and position information in the license through an OCR (optical character recognition) algorithm, and marking each identified character in a manual marking mode; and generating a plurality of training samples according to the manual labeling samples by a generating tool, wherein the training samples are used for training the improved BERT model.

Preferably, when the training sample is generated, a manually marked data enhancement means is adopted to enhance the sample, and the method comprises the following steps:

step 301, reading a manually marked sample;

step 302, generating a new field sample for each field in the sample obtained in step 301;

Step 303, randomly replacing, deleting and cutting off characters in the field sample, and increasing noise;

step 304, according to the position information of the manually marked sample obtained in step 301, the text field information for replacement generated in step 303 is filled into a blank license background picture to form a manually constructed license picture;

step 305, pasting the license picture generated in the step 304 onto a background picture;

and 306, performing image distortion and perspective transformation on the picture obtained in the step 305, and finally obtaining a new sample.

Aiming at the situation that manual configuration rules are needed, robustness is poor or a large number of training samples are needed in the traditional method, the invention provides a license key information extraction method based on small sample data. The present invention uses OCR models to obtain text and position in a picture. Then, the characteristics of texts, positions, pictures and the like are fused, input into an improved BERT model, interaction among the characteristics is learned, and the label of each text is output. Adjacent words with the same label are combined to form a field with the label. Then, the field position and the field tag information are input into the DenseRF model, and the field tags are mutually corrected according to the relative positions, angles and the like between the fields. Because the types and the positions of the fields in the license are relatively fixed, the DenseRF can achieve a better correction effect. The accurate extraction of the license key information under the condition of small sample data is realized through means of language model pre-training, multitask training, manual generation of training samples, correction based on position information and the like. Each license only needs more than ten images marked manually, so that model training can be realized.

Compared with the prior art, the technical scheme disclosed by the invention has the following beneficial effects:

(1) The requirement for the number of samples is reduced. According to the invention, language knowledge learned from massive text data by means of the BERT model is utilized, and the relative position relation of each field in the license is utilized by combining with the DenseRF model, so that accurate extraction of key information can be realized by only labeling about 20 picture samples for one license. The requirement on the number of training samples is greatly reduced, and the development difficulty is reduced.

(2) The workload of the developer is reduced, and the development speed is accelerated. The method only needs a very small amount of training samples, and reduces the workload of a developer for labeling the samples. Meanwhile, the method does not need to set related template files, does not need too much expert knowledge to write rules and license alignment methods, and accelerates the development speed of license key information extraction application.

(3) The personal privacy of the user is protected, and the risk of data disclosure is reduced. The invention only needs a small amount of training samples, and can enlarge the number of samples based on the generated samples. And excessive real pictures are not needed, so that the personal privacy of the user is protected, and the risk of sensitive data leakage is reduced.

(4) And the accuracy of key information extraction is improved. The improved BERT model and DensecRF model, the pre-training technology, the manual generation training sample technology and the like provided by the invention enable the model to have good extraction capability, the model itself also has good robustness, and the problems of unrecognized individual fields, license background interference, license angular rotation, scaling and the like can be solved.

Drawings

Fig. 1 is a system processing flow diagram.

Fig. 2 illustrates a modified BERT model.

FIG. 3 illustrates an improved BERT model training and prediction flow. The dashed line in the figure marks the improved BERT model training and prediction process. The dashed line represents the training process and the solid line represents the prediction process; the BERT labeling model comprises three parts, namely 1) a large-scale corpus data pre-training language model, 2) a license picture, an OCR result and a manual labeling data fine tuning model, and 3) an online use model for labeling an OCR recognition result.

Fig. 4 is a schematic diagram of field and field-to-field distance and angle calculations. The photograph contains some background and the document itself is rotated through some angle. The dashed box represents the text content that is detected and identified by the text. There are some cases in the license where some words are not recognized, such as two words of "name". There are several cases where a field is detected as a text box, i.e. "male ethnicity". The situation of calculating conditional probabilities from positions in the DensecRF model is also shown in the schematic. The distance and angle of "Sunwuk" relative to "sex" are calculated in the figure, both with reference to the upper left corner of the text box.

Fig. 5 illustrates the DensecRF model training and labeling result correction flow.

Fig. 6A and 6B illustrate examples of sampling the key information of the license text (the picture is an artificial picture, and the text information is randomly generated and only used for testing). The left graph in fig. 6B is the extraction result using only the modified BERT model, and it can be seen that most of the fields are extracted correctly, but the "relationship with homeowner" field is extracted incorrectly due to the interference of the name. The right hand graph in fig. 6B is the result of modification by the DensecRF model, and it can be seen that both the name and the relationship with the homeowner are extracted correctly. In addition, characters with OCR recognition errors can be seen, the models are marked as 'other', and the robustness is good.

Detailed Description

As shown in fig. 1, the license key information extraction method based on the small sample data disclosed in this embodiment includes the following steps:

and step 1, obtaining license pictures acquired from equipment such as a mobile phone, a scanner and the like through a license picture acquisition module. In this embodiment, after obtaining the license image, the license image obtaining module provides a photo clipping function for the user to remove the background outside the license in the license image as much as possible. The license picture obtaining module compresses the preprocessed license picture from two aspects of pixel number and picture definition, and uploads the image to the server after the compression is completed.

And 2, after the server receives the license picture, preprocessing the license picture before OCR by utilizing an image preprocessing module so as to improve the accuracy of character recognition. In this embodiment, preprocessing includes adjusting the angle, contrast, saturation, brightness, image pixel size, etc. of the picture.

And 3, recognizing the text box in the evidence picture and the characters contained in the text box by adopting an OCR recognition algorithm to obtain a text box area image and a character recognition result, and combining the character recognition result and the corresponding text box position information to obtain text information.

The OCR recognition algorithm is not the focus of the present invention, and OCR has relatively much research, and the present invention does not elaborate the OCR recognition process, but only introduces the format of the OCR recognition result. The OCR algorithm currently popular is a two-stage recognition method. In the first stage, a text detection model is used for detecting the position of a text fragment in a picture, and the position of a text line is represented by using a text box form. The text box is represented by coordinates of four points, which are arranged in a clockwise order starting from the point in the upper left corner of the text box, and can be represented as l= [ (x 1, y 1), (x 2, y 2), (x 3, y 3), (x 4, y 4) ]. The second stage cuts out text line pictures from the original image according to the detected text boxes, and then uses LSTM+CTC to calculateBy this method, the content of the text segment is identified, and the text content of one text box can be expressed as a sequence v= { t of characters _i }，t _i Is the i-th word in v. Thus, the result of OCR recognition of a picture includes two parts of content, a text box and recognized text, i.e., list { (v) _j ，l _j )}，l _j The j-th text box image obtained for OCR recognition, v _j And the text sequence corresponding to the j-th text box image.

The result of OCR recognition may have problems in that some text is not detected or text recognition is incorrect. Especially in the license word recognition process, the situation of field name missing or recognition error is more obvious. For example, in identification card text recognition, the field name and the background color are close, and the license background interference exists, so that the identification card text recognition is very easy to recognize. This poses a serious challenge for the extraction of critical information for the evidence. For this purpose, the invention further processes the result of OCR recognition by the following steps:

and 4, preprocessing operation before information extraction is carried out on the identified text information. The preprocessing operation comprises the following steps: the position of each text, the clipping text box area image, special character removal, abnormal text box and text removal, etc. are calculated.

And 5, marking the text information of the text information preprocessed in the step 4 by using an improved BERT model. The inputs to the improved BERT model include two sequences: text sequences and text segment sequences. The text sequence is a character sequence recognized by an OCR recognition algorithm, and is a recognition character which is arranged from top to bottom and from left to right on the text box area image according to each character. The sequence of text fragments is a sequence of recognized text boxes. The two sequences are separated by [ SEP ]. Each sequence comprises text information, picture position information, belonging OCR fragment information and sequence position information. The output of the key information extraction model is the label for each character. And extracting the text key information according to the labeling result of the key information extraction model.

Humans can extract the required text based on the content of the text and the location of the text. Therefore, the invention designs an improved BERT model, and fuses the information such as text content, position and the like. Firstly, the BERT model is the best language model at present, and the parameters are obtained by training from a massive corpus, so that the method has good generalization. The invention can well utilize the learned language model characteristics, namely the common sense knowledge of the natural language of human beings, and extract the information from the text content angle. Then, the BERT model is input with only text features and no OCR such position information. Therefore, improvements to the BERT model are needed to accommodate OCR key information extraction application scenarios.

The improved BERT model architecture proposed by the present invention is shown in fig. 2. The model as a whole comprises an OCR recognition result input, an embedding layer, a representation learning layer and a task layer.

The neural network model is very flexible in that arbitrary inputs can be fused together and input into the model. At the embedded layer of the key information extraction model, the invention fuses text content information, layout position information, fragment information, relative position information, text line image information and the like.

Giving a license picture, inputting the license picture into an OCR recognition algorithm to obtain a recognition result { (v) _j ，l _j ) }. The result of the recognition includes the recognized text content and the text box position. And calculating the approximate position of each character according to the positions of the text boxes and the number of characters in the text boxes. Inputting text content, text fragment positions and positions of various words into a key information extraction model, wherein the input information comprises two sequences: character sequence { (t) _i ，l _i ) Text fragment sequence { (v) _i ，l _i ) -wherein, l _i ＝(x ₁ ，y ₁ ，x ₃ ，y ₃ ) Is the upper left corner and lower right corner coordinates of the text box.

In fig. 2, text information is displayed on the left side, and text fragment information is displayed on the right side, with characters being at granularity, and text fragments being at granularity. Recognition results of, for example, an OCR recognition algorithm are two text fragments of "name" and "grand monkey" (fig. 4). The left-hand input is { last name, first name, sun, insight, sky }, and the right-hand input is two fragments of "name" and "sun-monkey". The two inputs are separated by a separator [ SEP ]. And the place where the input starts contains a CLS symbol indicating that the entire input information is learned. A maximum input length is given to all inputs, and if the input sequence does not reach this length, the PAD symbol is used to fill in the back. The entered information then includes several types. From bottom to top in the drawing, token information, layout position information, clip information, position information, and the like are included.

For the left word sequence, token information is an embedded representation of words, one for each word. For a text segment, an average of the included text-embedded representations may be used. The layout information comprises transverse and longitudinal positions, the pixel points are used for obtaining transverse and longitudinal embedded representations, the two embedded representations are spliced, and the calculation formula is as follows:

L(l _i )＝Concat(P _x (x ₁ ，x ₃ )，P _y (y ₁ ，y ₃ ))

clip information S (S) _i ) The portion is an embedded representation of the number of the segment in which each Token is located. The character segments identified by OCR are generally arranged in a sequence from top to bottom and from left to right, and are numbered.

The last information is absolute position information in the input sequence, the number is from 0 to N, N representing the maximum length of the input sequence. Then, all the embedded representations are added to obtain a final embedded representation e for each character _i The following formula is shown:

e _i ＝T(t _i )+L(l _i )+S(s _i )+P(i)

for the right text segment information part, token information is based on the corresponding picture of the segment, and the image feature V (V) is extracted by using VGG and other models _i ) As Token information. To reduce the complexity of the model, an average of the embedded representations of all the words contained in the segment may also be used. The layout information, the clip information, and the position information are the same as the text sequence. It is noted that [ CLS ] ]、[SEP]、[PAD]The position information of the special characters is set to 0. For the text segment part, the final embedded representation e _i The following formula is shown:

e _i ＝V(v _i )+P(i)+S(s _i )+L(l _i )

all embedded representations add up as input to the transfomer layer of the modified BERT model. The transducer layer learns the interaction information between the various input features, generating a feature representation for each input. The transducer layer uses a self-attention mechanism to calculate a representation vector for each input. The self-attention calculation formula is as follows:

wherein:

representing the representation vector corresponding to the ith input at the kth layer, wherein the representation vector is the embedded representation e at the first layer _i The method comprises the steps of carrying out a first treatment on the surface of the Weight alpha of attention _ij Is obtained by calculating the inner product of Query and key, and the Query and key are respectively represented by a vector h _i 、h _j Multiplying by a conversion matrix W ^Q And W is ^K Obtaining; d, d _k Represents h _i W ^Q The dimension of the multiplied vector;

Is the normalization operation of the weights; w (W) ^V Representing the mapping matrix. The self-attention mechanism may also use multi-headed attention. The calculation method has a plurality of multi-head calculated expression vectors in each layer, and the multi-head calculated expression vectors are spliced and used as the output of the layer so as to enhance the learning capacity of the model. Finally, for each input, the key information extraction model gets a representation output +. >

And the task layer completes various tasks according to the representation information learned by the transducer layer. Tasks can be divided into pre-training tasks and target tasks.

The pre-training task is an auxiliary task, the semi-supervision/non-supervision information is used for improving the training effect of the model, and the tasks comprise a text classification task, a Token prediction task, a word segment word number prediction task and the like. 1) The text classification task is to give an input sample and determine whether the current OCR recognition result of the text is the current license. 2) The Token prediction task is to predict what a part of characters is after the part of characters is input and blocked (Mask). 3) The word segment word count prediction task is to predict the number of words contained in each segment.

The goal task is to output a label to each word, marking what field the word belongs to. Based on the representation output by the transducer layer, a full connection layer is added to carry out classification task for each Token. For example, { last Name, first Name, grandchild, understand, null }, the obtained tag is { name_0, name_1}, indicating that it is the field of the first Name. Suffix 0 indicates that it is currently a field name, and 1 indicates that it is currently a field value. Here, it can be seen that the BERT model can know that "grand wu" is a person name from the language model of the text and the location information even though OCR does not recognize a key (i.e., "name") of a field. This will greatly improve the accuracy of text key information extraction. And for the non-important field in the license, or the character with the OCR recognition error, the character can be marked as 'Other' to be ignored. The method also improves the robustness of the model. In the training process, a cross entropy loss function and a gradient descent method are used for updating model parameters.

The improved BERT model training and prediction flow is shown in fig. 3, where the dashed arrows are the model training flow and the solid arrows are the prediction flow in fig. 3. In order to realize the extraction of key information under small sample data, the invention firstly trains the BERT model on a large-scale non-labeling corpus. After model pre-training is completed, the parameters of the model are copied into the improved BERT model disclosed by the invention to be used as initial values of the model parameters. Then, the invention uses a plurality of real license pictures, recognizes the characters and the position information in the license through OCR, and marks each recognized character in a manual marking mode. And then generating a plurality of samples according to the manual labeling samples by a generating tool. Training samples are used to train the improved BERT model proposed by the present invention. In the prediction stage, the license picture recognizes text content and position through OCR. Through preprocessing, text error correction, special character removal and the like are performed. And labeling the preprocessed result through a BERT model. And the marked result is input into the DenseRF model, and the marking correction is carried out on the individual fields, so that the accuracy is improved.

Because of the sensitivity of the license photos, the number of license photos that can be obtained is very limited. In addition, after OCR recognition is performed on each license, each character needs to be marked, and manpower and material resources are very consumed. Therefore, there is a need to provide a training sample data enhancement method that expands training data on the basis of limited labeled samples. The invention designs a manual annotation and data enhancement means, and the processing flow is as follows:

Step 501, reading a manually marked sample, which mainly comprises characters and character position information recognized by OCR.

Step 502, field generation. For each field, a new field sample is generated by adopting a candidate library, rules and the like. For example, a name field may sort a library of last names and first names, randomly selecting one last name and one first name at a time, and combining the last name and the first name into a new first name; randomly selecting one from a company name library as a unit name field; randomly selecting information of provinces, cities, cells and the like to be combined, and constructing address information; the license encoding field may be an n-bit encoding that includes only digits, etc. by rule generation.

And 503, randomly replacing, deleting and cutting off characters in the field. There may be some recognition errors in the actual OCR recognized text. Therefore, the generated field characters adopt random adding, deleting, replacing and cutting methods to increase noise. This approach is beneficial to improving the robustness of the model.

Step 504, adding license background. Filling the blank license background picture according to the position information of the labeling sample and the generated text field information in a replacement mode to form a manually constructed manual license picture.

Step 505, adding photo background. The generated manual license pictures are relatively regular. But the user typically uses a cell phone to take a picture, there is some background in the picture of the license. Thus, the generated manual license picture is pasted on some background pictures.

Step 506, image warping and transformation. The user may have some skew during the photographing process, and the license itself may also have folding marks, rubbing marks, and the like. Thus, the image is warped and perspective transformed based on the picture generated in step 505.

It should be noted that if the features of the picture are not used in the BERT model, the steps 504 and 505 may be used without using a real picture background or photo background and using a solid color background when generating the artificial sample.

And 6, dividing the character fragments recognized by OCR into fields based on the labeling result in the step 5, and correcting the labeling result of each field by adopting a DenseRF model. The DenseRF model mainly calculates probability distribution of field labels according to the characteristics of distance, angle and the like among the fields, and corrects the field marked with errors. After the DenseRF model is modified, fields with the same label are integrated to form a key value pair form, and the key value pair form is used as a final key information extraction result.

The improved BERT model enables structured processing of characters that are identified by evidence OCR. However, in practical application, if there are a plurality of similar fields, for example, there are a plurality of address information in the household book, etc. These fields are all addresses and are not far apart in the photograph. The improved BERT model may label errors, all labeled as birth addresses. Furthermore, because the samples are very limited, there may be situations where there are no samples in the actual line. For example, in an account book, one field is the relationship with the owner. The conditions occurring in the training samples include: couple, father and son, father and woman, etc. In practice, the situation of 'Zhang Sanzi' occurs, where Zhang Sanzi is the owner of the household. In this case, the improved BERT model would label the Zhang three of the "Zhang three sub" as the member name, and the sub as the relationship with the homeowner, since no similar sample was seen. However, humans do not determine errors because the position of the field "Zhang Sanzi" is just after the label "relationship with House owner".

Inspired by image segmentation research, the invention provides a conditional random field model, and the position limitation is increased on the basis of labeling of the BERT model, so that the labeling accuracy is improved. The basic idea is that the result of the BERT labeling is mostly correct, and then the mutual correction of the result can be performed according to the relative positional relationship (angle and distance) between the fields. The idea is mainly inspired by a DenseRF model in semantic segmentation, and in a semantic segmentation task, the result of the semantic segmentation is corrected according to labeling information of adjacent and similar pixel points.

There are also several problems to be solved using CRF to correct the results of text structuring.

1) Scaling, translation and rotation of the license in the picture. In each license photograph, the photographing angle and distance are different. This results in a difference in pixel size of the entire picture, text size in the picture, text angle. If the CRF model is built directly using the pixel locations of the text, the required sample size is very large. This is in contrast to the small sample based structured model of the present invention.

2) It is not clear which labels are correct and which are wrong in the BERT labeling result. The BERT outputs a label for each word, but cannot confirm which of them are mislabeled, requiring correction by the CRF model.

3) And constructing a CRF model according to the positions of the characters, the OCR fragments or the fields marked by the BERT. When the CRF model is constructed, the relative position of the characters is calculated, and characters, fragments and fields can be used as granularity. The calculation complexity and the accuracy of the correction result are different in different granularities. It is necessary to analyze the quality.

For this purpose, the invention proposes the following solution:

1) The distance is measured using the text height. Based on the results of OCR recognition, the height of the word is calculated. The distance between any two fields is measured by taking the height of the word as the basic distance unit.

2) Referring to DenseRF, the optimization is iterated. Since the result of the BERT labeling is mostly correct. Therefore, the labeling result of each field can be updated continuously in an iterative mode, and the overall labeling accuracy is improved.

3) The choice is in units of the BERT label field. The distance is calculated based on the upper left corner of the field box. Typically, each field in the license is left-aligned, the position of the upper left corner is generally fixed, and the number of words contained in the field is indeterminate.

The method specifically comprises the following steps:

constructing a DensecRF model in literal units has two drawbacks: 1) Not accurate enough; 2) Too much training data is required. Therefore, the invention takes the field marked by BERT as a basic unit to construct a DenseRF model. The field is to select the label with the highest probability according to the predicted label probability, and the text fragments composed of the characters with consistent labels and close adjacent characters are defined as the field. In the OCR recognition result of the identification card recognition, a plurality of text fragments are detected. For example, in fig. 4, the "male ethnicity" detected model in the text segment is put in the same text segment, but according to the labeling result of the improved BERT model, "ethnicity" is labeled "Nation_0" and "male" is labeled "Gender_1", the text segment of this OCR needs to be split into two fields as input of the DenseRF model.

Based on the position between fields, the conditional probability distribution of one field label can be calculated assuming the other field label is correct. Taking fig. 4 as an example, two fields x are given _i And x _j Calculated at a known x _i The tag of (1) (i.e., gender_0), x _i Probability p (x) of label of l (i.e. name_1) _i ＝l|x _j =l'). Note that the symbol marks here are not associated with the BERT model. In the foregoing discussion, the dependency between tags and the distance between them andthe angles are related. Thus, the way the conditional probability is calculated can be defined as follows:

p(x _i ＝l|x _j ＝l′)＝R(θ(x _i ，x _j )，d(x _i ，x _j )，l′，l)

wherein: r represents a probability tensor; in R (θ (x) _i ，x _j )，d(x _i ，x _j ) In l', l), θ and d are the angle and distance between the two fields, respectively, R (θ (x) _i ，x _j )，d(x _i ，x _j ) L ', l) is abbreviated as R (θ, d, l', l), Σ _l R (θ, d, l', l) =1. The set of license field types contains an additional tag "other".

If the number of samples is sufficiently large, the probability obtained based on statistics can approach the true probability value indefinitely. However, in consideration of the case where the scene of the present invention is small sample data, a data smoothing method may be adopted. In the present invention, for a sample (x _i ＝l，x _j Calculating the angle and distance to be θ and d, respectively, for the conditional probability tensor R (θ, d, l ', l) +.r (θ, d, l', l) +1, and taking into account that the distance calculated from the word height may not be particularly accurate and that there may be a small angle of rotation of the picture, updating by data smoothing is performed as follows:

R(θ+δ _θ ，d+δ _d ，l′，l)←R(θ+δ _θ ，d+δ _d ，l′，l)+τ(δ _θ ，δ _d )

0＜τ(δ _θ ，δ _d )＜1

Wherein: delta _θ 、δ _d Small offsets of angle and distance, respectively; τ (delta) _θ ，δ _d ) Is a smoothed value in the case where the distance and angle are increased by the offset, the larger the offset value is, the smaller the increased smoothed value is. Finally, the tensor is normalized to sigma _l R(θ，d，l′，l)＝1。

Constructing a DenseRF model, and defining an energy function as shown in the following formula:

wherein: x represents a sample, namely after a license picture is structured by an improved BERT model, all text fields are marked with corrected results; x is x _i Representing an i-th field; psi phi type _u (x _i ) Representing a unitary energy function; for each field x _i All other fields x are calculated _j For field x _i The binary energy psi generated _p (x _i ，x _j ). And the energy value is minimized, so that the optimal labeling result can be obtained.

The unitary energy function may take on a value ψ _u (x _i )＝-log(p(x _i ) And), wherein p (x _i ) Is the probability distribution case for field i, which is the result output by the modified BERT model.

The binary energy function is defined as follows:

wherein: mu (x) _i ，x _j ) Is a function for measuring x _i And x _j Under the label of (a), the larger the penalty value is, the larger the energy function value is, and the whole is unstable, the invention adopts the following penalty function: mu (x) _i ，x _j )＝1-p(x _i |x _j )；

k ^(m) (f _i ，f _j ) Is a kernel function for measuring x _j And x _i The magnitude of the interaction; w (w) ^(m) The weight is used for fusing the results of a plurality of kernel functions; k represents K kernel functions. In the invention, only one Gaussian kernel is adopted, so that the closer the distance is, the larger the influence on the labeling of the target field is, as shown in the following formula:

wherein p is _i And p _j Respectively representing field x _i And x _j θ represents the standard deviation hyper-parameter, exp (·) is an exponential function.

Solving DenseRF model

Since each field label cannot be completely correct, it is very difficult to directly calculate the joint distribution probability of the labeling result P (X) of all labels. One distribution Q (X) can be calculated using the average field approximation (mean field approximation) method, minimizing the KL divergence D (Q||P) between the two distributions. The theoretical colloquial of the average field is to update the labeling result of each field in an iterative manner. In each updating process, firstly, the labels of other fields are assumed to be correct, and then the label probability distribution of the current target field is updated in combination with the conditional probability. Such operations are adopted for each field, and the whole labeling condition is continuously and iteratively updated.

According to the deduction result of the DenseRF model, the update mode of Q (X) is as follows:

wherein Q is _i (x _i =l) represents the probability that the i-th field is labeled as label l,

is x _i The probability sum of all possible values is used for normalizing the probability value.

From the updated formula described above, the following inference algorithm can be obtained:

the DenseRF model is learned by a number of parameters, including weights w and kernel functions θ. The solution process by gradient descent is cumbersome due to the presence of the normalized computation Z. The optimal weight and kernel function parameters are obtained by adopting a grid search method.

And 7, carrying out information extraction post-processing on the key information extraction result according to the preset data type of each field. The means of post-treatment include: error correction, date normalization, abnormal character deletion, word order adjustment, enumeration field correction, and the like.

Claims

1. The license key information extraction method based on the small sample data is characterized by comprising the following steps of:

step 1, acquiring a license picture;

step 2, adopting an OCR recognition algorithm to recognize each text box and all text contents in the license picture, and obtaining an OCR recognition result { (v) _j ，l _j ) -wherein: v _j The text content of the j-th text box is expressed as a sequence of characters; l (L) _j For the j-th text box, denoted by l _j ＝[(x1，y1)，(x2，y2)，(x3，y3)，(x4，y4)](x 1, y 1), (x 2, y 2), (x 3, y 3), and (x 4, y 4) are the four corner coordinates of the jth text box;

representing a text sequence v _i The j-th character of (a) character +.>

Final embedded representation->

The following formula is shown:

wherein:

is Token information, is character +.>

Is embedded in the representation; l (L) _i ) An embedded representation of layout position information; s (S) _i ) The embedded representation of the fragment information is the embedded representation of the serial number of the text fragment where each Token information is located; p (i) is an embedded representation of position information, which is an embedded representation of absolute position information in the input sequence;

e _i ＝V(v _i )+P(i)+S(s _i )+L(l _i )

wherein: mu (x) _i ，x _j )＝1-p(x _i |x _j )，p(x _i |x _j ) Is a conditional probability in a known field x _j The label is l', field x _i The conditional probability that the label of l is expressed as p (x _i ＝l|x _j ＝l′) The following steps are:

p(x _i ＝l|x _j ＝l′)＝R(θ(x _i ，x _j )，d(x _i ，x _j )，l′，l)

wherein: r (θ (x) _i ，x _j )，d(x _i ，x _j ) L', l) represents a conditional probability tensor, θ and d are the angle and distance between two fields, respectively, wherein the distance between any two fields is measured in terms of word height as a basic distance unit; the conditional probability tensor R (θ (x) _i ，x _j )，d(x _i ，x _j ) And l ', l) is simply expressed as R (theta, d, l ', l), then R (theta, d, l ', l) is updated by adopting a data smoothing method, and R (theta+delta) _θ ，d+δ _d ，l′，l)←R(θ+δ _θ ，d+δ _d ，l′，l)+τ(δ _θ ，δ _d )，δ _θ 、δ _d Small deviations in angle and distance, respectively, τ (δ _θ ，δ _d ) Is a smoothed value with distance and angle increased by offset, and has 0 < τ (delta) _θ ，δ _d ) < 1, normalize the conditional probability tensor to Σ _l R(θ，d，l′，l)＝1；

k represents K kernel functions;

2. The license key information extraction method based on small sample data as claimed in claim 1, wherein in step 1, after the license picture is acquired, the background except the license in the license picture is removed by a clipping function.

3. The license key information extraction method based on small sample data as claimed in claim 1, wherein in step 1, after the acquired license picture is compressed, the license picture is uploaded to a server, and the server adopts the subsequent steps to process the uploaded license picture.

4. The license key information extraction method based on small sample data as claimed in claim 1, wherein in step 2, the license picture is preprocessed to improve the accuracy of character recognition, and then OCR recognition is performed on the license picture.

5. The license key information extraction method based on small sample data as claimed in claim 1, wherein in step 3, the preprocessing is performed on the OCR recognition result before the text information labeling is performed on the OCR recognition result using the improved BERT model, and the preprocessing operation includes: the position of each text, the crop text box area image, special character removal, abnormal text box and text removal are calculated.

6. The license key information extraction method based on small sample data as recited in claim 1, wherein in step 3, the embedded representation of the layout position information L (L _i ) The following formula was used for calculation:

L(l _i )＝Concat(P _x (x ₁ ，x ₃ )，P _y (y ₁ ，y ₃ ))

7. The license key information extraction method based on small sample data as recited in claim 1, wherein in step 3, the transform layer calculates a representation vector of each input using a self-attention mechanism, wherein a self-attention calculation formula is as follows:

Wherein:

8. The license key information extraction method based on small sample data according to claim 7, wherein in step 3, ignoring a character marked as "Other" which is not important in a license or is recognized as an error by an OCR recognition algorithm in the target task.

9. The method for extracting license key information based on small sample data as recited in claim 7, wherein the training of the improved BERT model:

10. The method for extracting license key information based on small sample data according to claim 9, wherein when the training sample is generated, a manual labeling and data enhancing means is adopted to enhance the sample, and the method comprises the following steps:

step 301, reading a manually marked sample;