CN115497114B

CN115497114B - Structured information extraction method for cigarette logistics receiving bill

Info

Publication number: CN115497114B
Application number: CN202211442689.4A
Authority: CN
Inventors: 曾华; 徐伟; 刘永海; 朱小晓; 胡晓峰; 李涛; 周幸; 曾鹏程; 李�瑞; 廖健; 王静雅; 付雯; 龙涛
Original assignee: Shenzhen Aimo Technology Co ltd; China National Tobacco Corp Sichuan Branch
Current assignee: Shenzhen Aimo Technology Co ltd; China National Tobacco Corp Sichuan Branch
Priority date: 2022-11-18
Filing date: 2022-11-18
Publication date: 2024-03-12
Anticipated expiration: 2042-11-18
Also published as: CN115497114A

Abstract

The invention discloses a structured information extraction method of a cigarette logistics receiving bill, which comprises a pre-labeling step and an identification step, wherein the pre-labeling step comprises the following steps: setting template picture standards of the bill, selecting standard template pictures, and labeling key and value on the template pictures, wherein the key is a fixed keyword in the bill, and the value is variable content in the bill; and (3) identification: determining a picture to be identified, matching a key of the picture to be identified with a key of a template picture, setting text boxes except the key in the picture to be identified as value candidate boxes, aligning the picture to be identified with the template picture according to the corresponding relation between the keys, correcting the value candidate boxes in a dislocation mode, and extracting structural information according to the content in the value text boxes of the template picture. The method is flexible and high in applicability, can solve the problem of printing dislocation, and is high in recognition accuracy.

Description

Structured information extraction method for cigarette logistics receiving bill

Technical Field

The invention relates to the field of logistics, in particular to a structured information extraction method of a cigarette logistics receipt.

Background

In the tobacco logistics scene, the receiver needs to confirm the information of the delivery party, and the information of the bill needs to be checked and checked with the information recorded by the system. The manual auditing needs a lot of time and is easy to make mistakes, and an alternative scheme is to automatically extract the structured information (date, number and the like) on the bill through an image recognition algorithm and compare the structured information with the structured information recorded by the system.

At present, two main methods are mainly used for extracting structural information of bills, one is to use related rules such as regular expression matching and the like to carry out post-processing on the result of ocr (optical character recognition) and optical character recognition, the method is flexible but the accuracy is not high, especially the situation of printing dislocation cannot be solved, the other is to use a deep learning method to detect the position of each field and then carry out ocr content recognition on each field, the method accuracy is high, but a large amount of data are required to be collected for labeling and training for each bill, and the method is not flexible and the applicability is not high.

Disclosure of Invention

Based on the above, the invention provides a structured information extraction method for cigarette logistics receiving bills, which is based on structured information extraction of template alignment and dislocation correction, only one template picture needs to be marked for each bill, is flexible and high in applicability, can solve the printing dislocation condition, and is high in recognition accuracy.

The technical scheme of the invention is as follows:

the method for extracting the structured information of the cigarette logistics receiving bill is characterized by comprising the following steps of:

pre-marking: setting template picture standards of the bill, selecting standard template pictures, and labeling key and value on the template pictures, wherein the key is a fixed keyword in the bill, and the value is variable content in the bill;

and (3) identification: determining a picture to be identified, matching a key of the picture to be identified with a key of a template picture, setting text boxes except the key in the picture to be identified as value candidate boxes, aligning the picture to be identified with the template picture according to the corresponding relation between the keys, correcting the value candidate boxes in a dislocation mode, and extracting structural information according to the content in the value text boxes of the template picture.

The thought of the technical scheme is as follows:

a ticket consists of two parts: a fixed key (such as name) and a variable value (such as Zhang San) of the content, each bill accords with a specific typesetting style, the key content is unchanged and the positions are completely aligned, the value content is variable and even the length and the number of lines are variable, but the positions basically only fluctuate near a preset position.

Based on the characteristics of the bills, the design scheme of the invention selects a standard template picture for each bill to carry out key and value marking, matches the key in the picture to be identified with the template key for correlation, carries out template alignment through perspective transformation, and then can extract corresponding structural information near the preset value frame position.

In the pre-marking step, the template picture standard of the bill is flat, non-inclined and printed picture without dislocation.

In the pre-labeling step, the step of labeling the key is as follows:

and performing rectangular frame labeling and text content labeling on the template picture, and setting a compact rectangular frame as a keyword area.

In the pre-labeling step, the labeling value step is as follows:

and marking rectangular frames and field names of fields needing to be identified except for marking keys in the template picture.

The identifying step further comprises the following steps:

all text boxes and text content in the picture to be identified are detected and identified at ocr.

The identifying step further comprises the following steps:

and judging whether the text boxes and text contents in the obtained picture to be identified belong to key text boxes of the template picture or not through keyword matching, if so, associating the keys in the picture to be identified with the keys of the template picture to form a group of key corresponding relations, if not, taking the text boxes and the text contents in the picture to be identified as value candidate boxes, and if no group of key corresponding relations exist, failing to identify the current picture.

The identifying step further comprises the following steps:

aligning the picture to be identified with the template picture according to the corresponding relation between the keys, extracting 4 vertexes of the text box, and establishing the corresponding relation of 4 groups of vertexes according to the corresponding relation between each group of picture keys to be identified and the template picture key; when N groups of key corresponding relations exist, a corresponding relation of N4 groups of vertex coordinates is established, a homography matrix is calculated according to the corresponding relation between the vertex coordinates, and the picture to be identified and the template picture are aligned through perspective transformation.

The identifying step further comprises the following steps:

and translating all the value candidate frames at least once according to a preset rule, calculating the alignment degree of the value candidate frames and the template value frame of each displacement, selecting the displacement with the highest alignment degree as final dislocation displacement, and carrying out dislocation correction on all the value candidate frames according to the displacement to obtain the dislocation corrected value frame of the picture to be identified and the content thereof.

The identifying step further comprises the following steps:

extracting structural information near the template picture value text boxes, finding one template picture value text box with the largest overlapping area for each value candidate box, if the overlapping degree of the two template picture value text boxes is larger than a set threshold value, associating the value candidate box with the corresponding template picture value text box, otherwise, neglecting the value candidate box;

after the association of all the value candidate boxes is completed, the value candidate box associated with each template picture value text box is the content corresponding to the template picture value field, the text content is connected in series to obtain the extracted structured information of the field, and if the template picture value text box is not associated with the value candidate box, the field cannot be identified.

The beneficial effects of the invention are as follows:

1. structured information extraction based on template alignment and dislocation correction only needs to mark one template picture for each bill, and the method is flexible and high in applicability, meanwhile, the problem of printing dislocation can be solved, and the recognition accuracy is high;

2. selecting a standard template picture for each bill to carry out key and value labeling, matching the key in the picture to be identified with the template key to carry out association, and carrying out template alignment through perspective transformation, so that corresponding structured information can be extracted near the preset value frame position;

3. in order to solve the interference caused by the value printing dislocation, after the key template is aligned, dislocation correction is carried out according to the alignment degree of the to-be-identified picture value and the template value, so that the accuracy of the extraction of the structured information is greatly improved.

Detailed Description

The following describes embodiments of the present invention in detail.

Examples:

a method for extracting structured information of a cigarette logistics receiving bill comprises the following steps:

pre-marking: selecting a standard template picture, and marking the template picture by key and value;

and (3) identification: ocr detection and recognition are carried out on the picture to be recognized, a key text box is matched with a template key from ocr, the rest text boxes are used as value candidate boxes, the picture to be recognized and the template picture are aligned according to the corresponding relation between the keys, dislocation correction is carried out on the value candidate boxes, and structural information is extracted near the value boxes preset by the template.

The idea of the above embodiment is as follows:

In the pre-marking step, the selected standard template picture is a flat picture which does not have inclination and is printed without dislocation.

In the pre-labeling step, keys are fixed keywords in the bill, rectangular frame labeling and text content labeling are carried out on the keys, compact rectangular frames only comprise keyword areas, value is variable content in the bill, all the contents are not labeled, only fields needing to be identified are labeled, and rectangular frame labeling and field name labeling are carried out on the keys.

In the identification step, all text boxes and text contents in the picture are detected and identified through ocr.

In the identification step, judging whether all ocr text boxes belong to key text boxes through keyword matching, if so, associating the text boxes with corresponding template keys to form a group of key corresponding relations, if not, taking the text boxes as value candidate boxes, and if any group of key corresponding relations do not exist, failing to identify the current picture.

In the identification step, the picture to be identified is aligned with the template picture according to the corresponding relation between the keys, 4 vertexes of the text box are extracted, the corresponding relation of 4 groups of vertexes is established according to the corresponding relation between each group of key text box and the template key, if N groups of key corresponding relations exist, the corresponding relation of N multiplied by 4 groups of vertex coordinates can be established, a homography matrix is calculated according to the corresponding relation between the vertex coordinates, the picture to be identified is aligned with the template picture through perspective transformation, and the positions of the value candidate boxes are also transformed onto the aligned picture through the same transformation matrix.

In the identification step, dislocation correction is carried out on the value candidate frame, after template alignment is carried out, if printing dislocation does not exist, the value candidate frame falls inside the template value frame, if printing dislocation exists, the value candidate frame has offset, and alignment with the template value frame falls on the edge of the template value frame or outside the template value frame.

And translating all the value candidate frames in a certain range for multiple times, translating all the directions up, down, left and right in a range of taking the original position of the value candidate frame as a center and taking the radius of the x and y directions as 50 pixels, wherein the total translation times are (50/10 x 2+1) 2=121 in units of 10 pixels, calculating the alignment degree of the value candidate frame and the template value frame of each displacement, selecting the displacement with the highest alignment degree as the final dislocation displacement, and carrying out dislocation correction on all the value candidate frames according to the displacement.

Marking the ith template value frame asNumber of framesN, j candidate value box +.>The number of frames is m, the intersection is an intersection operation, the area is an area operation, the bin is a binarization function (1 if the condition is satisfied, or 0 if the condition is satisfied), and the calculation formula of the alignment_ratio is as follows:

。

in the identifying step, structural information is extracted near a value frame preset by a template, one template value frame with the largest overlapping area is found for each value candidate frame, if the overlapping degree of the two value candidate frames is larger than a set threshold value, the value candidate frame is associated with the template value frame, otherwise, the value candidate frame is ignored, after all the value candidate frames are associated, all the value candidate frames associated with each template value frame are the content corresponding to the template value field, the text content of each value candidate frame is connected in series to obtain the structural information of the extracted field, if the template value frame is not associated with any value candidate frame, the field cannot be identified, the template value frame is recorded as tv, the value candidate frame is v, and the overlapping degree overlap-ratio is calculated as follows:

。

the method is flexible and high in applicability, can solve the problem of printing dislocation, and is high in recognition accuracy. And selecting a standard template picture for each bill to carry out key and value labeling, matching the key in the picture to be identified with the template key to carry out association, and carrying out template alignment through perspective transformation, so that corresponding structured information can be extracted near the preset value frame position. In order to solve the interference caused by the value printing dislocation, after the key template is aligned, dislocation correction is carried out according to the alignment degree of the to-be-identified picture value and the template value, so that the accuracy of the extraction of the structured information is greatly improved.

The steps in the pre-labeling stage are as follows:

1. a standard template picture is selected.

The template picture should be as flat as possible, without tilting and without misplacement of the printing.

2. And (5) performing key and value labeling on the template picture.

The key is a fixed keyword in the bill, and rectangular frame marking (a compact rectangular frame only comprises a keyword area) and text content marking are carried out on the key.

The value is variable content in the bill, all the content is not required to be marked, only the field to be identified is marked, and rectangular frame marking (a wide rectangular frame can cover all the position range where the field content appears) and field name marking (the key and the value are not in one-to-one correspondence, and some value have no key at all, so the field corresponding to the value is directly specified).

After combing in the above examples, the following is described in further detail.

The steps in the identification phase are as follows:

1. and ocr, detecting and identifying the picture to be identified.

All text boxes and text content in the picture are detected and identified by ocr.

2. From the ocr result, the matching key text box is associated with the template key, and the rest text boxes are used as value candidate boxes.

Judging whether all ocr text boxes belong to key text boxes through keyword matching, if so, associating the text boxes with corresponding template keys to form a group of key corresponding relations, and if not, taking the text boxes as value candidate boxes.

If any group of key corresponding relations do not exist, the current picture cannot be identified, otherwise, the step 3 is continuously executed.

3. And aligning the picture to be identified with the template picture according to the corresponding relation between the keys.

4 vertexes of the text box are extracted, a corresponding relation of 4 groups of vertexes can be established according to the corresponding relation of each group of key text boxes and the template keys, and if N groups of key corresponding relations exist, a corresponding relation of N multiplied by 4 groups of vertex coordinates can be established. According to the corresponding relation between the vertex coordinates, a homography matrix is calculated, the picture to be identified and the template picture are aligned through perspective transformation, and the position of the value candidate frame is also transformed to the aligned picture through the same transformation matrix.

The related description and principle of homography matrix and perspective transformation are the prior art and are not repeated

4. And performing dislocation correction on the value candidate frame.

After the template alignment is performed, if there is no print misalignment, the value candidate frame should fall as far as possible inside the template value frame, whereas if there is print misalignment, there is a certain degree of offset of the value candidate frame, the alignment with the template value frame is not good (falls on or outside the edge of the template value frame).

And translating all the value candidate frames in a certain range for a plurality of times (for example, translating all the values in the directions up, down, left and right in a range of taking the original position of the value candidate frame as the center and taking the radius of the x and y directions as 50 pixels, taking 10 pixels as a unit, wherein the total translation times are (50/10 x 2+1) 2=121), calculating the alignment degree of the value candidate frame and the template value frame of each displacement, selecting one displacement with the highest alignment degree as the final dislocation displacement, and carrying out dislocation correction on all the value candidate frames according to the displacement.

Marking the ith template value frame asThe number of frames is n, the j candidate value frame is +.>The number of frames is m, the intersection is an intersection operation, the area is an area operation, the bin is a binarization function (1 if the condition is satisfied, or 0 if the condition is satisfied), and the calculation formula of the alignment_ratio is as follows:

5. and extracting the structural information near a value frame preset by the template.

For each value candidate frame, finding a template value frame with the largest overlapping area, if the overlapping degree of the template value frame and the template value frame is larger than a set threshold value (such as 0.6), associating the value candidate frame to the template value frame, otherwise, ignoring the value candidate frame. After all the value candidate boxes are associated, all the value candidate boxes associated with each template value box are the contents corresponding to the template value field, the text contents of the template value boxes are connected in series to obtain the extracted structural information of the field, and if the template value box is not associated with any value candidate box, the field cannot be identified.

And (3) recording a template value frame as tv and a value candidate frame as v, and calculating the overlapping degree overlap ratio according to the following formula:

。

the homography matrix is explained as follows:

respectively acquiring coordinates of a text box of a key marked by a template picture and coordinates of a text box of a key matched by a picture to be identified, and establishing a corresponding homography matrix according to the acquired corresponding relation between the coordinates; and transforming the coordinates of the value of the picture to be identified into aligned coordinates according to the homography matrix of the marked key of the template picture and the matched key of the picture to be identified, so as to obtain the aligned value of the picture to be identified.

Specifically, for the text box corresponding to each key, respectively obtaining four vertex coordinates of the text box of the key marked by the template picture and four vertex coordinates of the text box of the key matched with the picture to be identified, and establishing the following homography matrix according to the obtained correspondence between the four groups of vertex coordinates;

（1）；

wherein (x 1, y 1), (x 2, y 2), (x 3, y 3), (x 4, y 4) are four vertex coordinates of a text box corresponding to the current key identified by the picture to be identified, (x '1, y' 1), (x '2, y' 2), (x '3, y' 3), (x '4, y' 4) are four vertex coordinates of a text box corresponding to the current key marked by the template picture, and h11, h12, h13, h21, h22, h23, h31, h32 are unknown parameters to be solved. And 8 unknown parameters of the homography matrix can be obtained by bringing corresponding coordinate values into the formula (1), and the solved homography matrix is input into an image transformation model to obtain the aligned keys of the picture to be identified. The image transformation model of the present embodiment is a perspective transformation model. The alignment of the matched value for the picture to be identified is shown above and will not be described in detail here.

In a real application scenario, each set of coordinate points (e.g., (x 1, y 1) and (x '1, y' 1) calculated in the above embodiment form a set of coordinate points, which will be hereinafter referred to as a point pair) will all include noise. For example, the position of the coordinate point deviates by several pixels, even a phenomenon of mismatching of the characteristic point pair occurs, and if the homography matrix is calculated by using only four points, a large error occurs. So to make the computation more accurate, a homography matrix is typically computed using much more than four points.

In the above embodiment, four vertices of the text boxes of all keys are used to calculate a homography matrix, and the method used is RANSAC, which specifically includes the steps of:

(1) Randomly selecting 4 pairs of matching characteristic points from the initial matching point pair set S as an internal point set Si, and estimating an initial homography matrix Hi;

(2) The remaining matching point pairs in S are calculated with Hi. If the projection error of a certain characteristic point is smaller than a threshold t, adding the characteristic point into Si;

(3) Recording the number of matching point pairs in the Si set;

(4) Repeating the steps (2) - (3) until the iteration number is greater than K;

(5) The point pair number obtained by the iterative calculation is the most, and the estimation model with the most point pair number is the homography matrix to be solved.

The foregoing examples merely illustrate specific embodiments of the invention, which are described in greater detail and are not to be construed as limiting the scope of the invention. It should be noted that it will be apparent to those skilled in the art that several variations and modifications can be made without departing from the spirit of the invention, which are all within the scope of the invention.

Claims

1. The method for extracting the structured information of the cigarette logistics receiving bill is characterized by comprising the following steps of:

and (3) identification: determining a picture to be identified, matching a key of the picture to be identified with a key of a template picture, setting text boxes except the key in the picture to be identified as value candidate boxes, aligning the picture to be identified with the template picture according to the corresponding relation between the keys, correcting the value candidate boxes in a dislocation manner, and extracting structural information according to the content in the value text boxes of the template picture;

in the pre-labeling step, the step of labeling the key is as follows: rectangular frame labeling and text content labeling are carried out on the template picture, and compact rectangular frames are set as keyword areas; in the pre-labeling step, the labeling value step is as follows: performing rectangular frame labeling and field name labeling on fields needing to be identified except for labeling keys in the template picture;

in the identifying step, the step of correcting the value candidate frame by dislocation is as follows: and translating all the value candidate frames at least once according to a preset rule, calculating the alignment degree of the value candidate frames and the template value frame of each displacement, selecting the displacement with the highest alignment degree as final dislocation displacement, and carrying out dislocation correction on all the value candidate frames according to the displacement to obtain the dislocation corrected value frame of the picture to be identified and the content thereof.

2. The method for extracting structured information from a cigarette logistics and goods receipt according to claim 1, wherein in the pre-labeling step, the template picture standard of the receipt is a flat, non-inclined and printed picture without dislocation.

3. The method for extracting structured information from a cigarette logistics and goods receipt of claim 1 or 2, wherein the step of identifying further comprises the steps of:

4. The method for extracting structured information from a cigarette logistics receipt of claim 3, wherein the step of identifying further comprises the steps of:

5. The method for extracting structured information from a cigarette logistics receipt of claim 4, wherein the step of identifying further comprises the steps of:

aligning the picture to be identified with the template picture according to the corresponding relation between the keys, extracting 4 vertexes of the text box, and establishing the corresponding relation of 4 groups of vertexes according to the corresponding relation between each group of picture keys to be identified and the template picture key;

when N groups of key corresponding relations exist, a corresponding relation of N4 groups of vertex coordinates is established, a homography matrix is calculated according to the corresponding relation between the vertex coordinates, and the picture to be identified and the template picture are aligned through perspective transformation.

6. The method for extracting structured information from a cigarette logistics receipt of claim 5, wherein the step of identifying further comprises the steps of: