CN110991265B

CN110991265B - Layout extraction method for train ticket image

Info

Publication number: CN110991265B
Application number: CN201911103715.9A
Authority: CN
Inventors: 王俊峰; 唐鹏; 高琳; 陈懿
Original assignee: Sichuan University
Current assignee: Sichuan University
Priority date: 2019-11-13
Filing date: 2019-11-13
Publication date: 2022-03-04
Anticipated expiration: 2039-11-13
Also published as: CN110991265A

Abstract

The invention discloses a layout extraction method of a train ticket image, which extracts a convex quadrangle with the largest area as an external contour of a train ticket by self-adaptively binarizing a video acquired picture and simplifying the contour. Calculating a projection transformation matrix according to the contour vertex, and standardizing the size and the gray value of the train ticket image; detecting character lines, and deleting undersized or oversized character lines in the character line set; according to the ordinate of the character line, clustering the character line by using DBSCAN; and finally, according to template rules, distributing attributes to the character lines after clustering and sequencing to realize layout analysis. The method improves the robustness of the face analysis of the railway ticket, reduces the workload of financial staff, supports the intelligent degree of information input of a financial ticket system, is favorable for getting through the boundaries of paper invoices and electronic taxes, and realizes the popularization of an intelligent invoice identification technology.

Description

Layout extraction method for train ticket image

Technical Field

The invention relates to the field of automatic processing of train ticket images, in particular to a layout extraction method of a train ticket image.

Background

Train tickets are vouchers for purchasing train travel services for outbound check-ups and subsequent financial reimbursements while riding in a train. Because of the wide breadth of our country, the train trip has both economy and high efficiency, and the importance degree is not enough. Along with the construction of high-speed railway lines such as Sichuan-Tibet railway and the like, the railway network in China is gradually improved, and the railway trip occupies more important position in the future; with the further popularization of China railways in the world, tickets are used as travel vouchers, and the contents of the tickets are urgently identified and analyzed automatically by an informatization method. The railway ticket invoice still remains an important content of financial reimbursement at present and for a long time in the future, and plays an important role in the national economic development and construction process. The train tickets occupy a very large proportion in the financial reimbursement work task, and the specialized printing specification and layout arrangement thereof urgently await automatic layout analysis and content identification based on intelligent image processing.

The form and content of the railway ticket are related to the modernization level of railway construction in China, the form and system of the railway ticket in China have different characteristics in different historical periods, and the railway ticket is changed from a hard-plate type railway ticket to a soft paper type railway ticket and then to a magnetic card type railway ticket. After the new China is established, the first generation railway ticket of China railway is a hard plate type railway ticket, the size of which is 57 multiplied by 25 mm, and braille is printed on the ticket surface. The train is divided into fast and slow trains, the ticket surface of the fast train is printed with a red line, and the ticket surface of the extra-fast train is printed with two red lines. The colors of the surface shading are respectively specified as follows: the soft seat ticket is light blue, the hard seat ticket is light red, the suburb ticket is light purple, the simple ticket is light green, the box ticket is orange yellow, and the like. In the 80 s of the 20 th century, Shenzhen railway station in China was the first to sell tickets by using computers, and the tickets were also changed into soft paper type railway tickets. In 1997, the Ministry of railroads determined a uniform pattern of computer tickets. The electronic ticket is not printed in advance, but printed on site by a hot transfer ticket machine adopting a non-impact printing technology during ticket selling. In 2007 and 7 months, the hard-board type train tickets which are used for more than 100 years gradually quit the historical stage and are completely replaced by nationwide networked electronic tickets. The station can sell soft paper type train ticket. In 2008, magnetic card type train tickets are sold successively at railway stations of large and medium-sized cities in China. The magnetic card type train ticket is a disposable ticket, the hardness of the ticket surface is higher than that of a soft paper type train ticket, the pattern of a motor train unit is printed on the front surface of the ticket, and the riding awareness of railway passengers is printed on the back surface of the ticket. The method is characterized in that a hot-roll ticket dispenser adopting a non-impact printing technology is used for printing on site during ticket selling, and magnetic information and thermosensitive information are implanted into the back of a ticket. From 2009, the national railway ticketing system is upgraded and updated, the one-dimensional code anti-counterfeiting mark below the ticket is changed into a two-dimensional code anti-counterfeiting mark, and the anti-counterfeiting function is more powerful. In addition to the common red soft paper type train ticket, the light blue magnetic card type train ticket is also upgraded. In 2011, after the train ticket is named, the ticket is added with information such as a two-dimensional code, the name of a ticket buyer, an identity card number and the like, wherein 4 digits of the identity card number are replaced by an asterisk to protect personal information. In 2011, Jingjin intercity tries to sell tickets on the Internet first, which marks that China continental railway ticket selling enters the Internet ticket selling era for the first time, and carries out online ticket selling and water testing for Jinghushi high-speed rail. In 2015, from 6 months, a new version of railway ticket is tried to be sold in a part of domestic cities, a railway 12306 website publishes a new ticket style, the ticket face is adjusted to be 'moved' out of an advertisement area, and the new ticket is declared to be tried from 6 months and 25 days in the current year, from 6 months and 25 days to 7 months and 31 days are new and old tickets and a transition period, and from 8 months and 1 day, the new ticket is completely used.

The appearance of train tickets has entered a stable period since 2015 to date. The study was conducted for trains at this time. From the content, the train ticket mainly comprises a passenger ticket and an additional ticket. The passenger ticket part is a soft seat and a hard seat. The additional ticket part is a speeder ticket, a sleeping berth ticket, a soft sleeping berth ticket and the like. In order to offer the best to children, students and disabled soldiers, the Chinese railway also sells half-value tickets. The train ticket surface contains various information including the type, time, seat number, etc. The Chinese train ticket is hard paper ticket, soft paper ticket, magnetic card ticket, electronic ticket, etc. The ticket face of the train ticket contains various information including information of riding interval, train number, starting point, seat number, seat grade, ticket price, station for sale and the like. The train number coding is specified according to the rules of the Ministry of railways, the directions of all roads to Beijing and branch roads or the specified direction are uplink directions, and the train number is coded into a double number; the direction of the whole road away from Beijing and the trunk line to the branch line or the designated direction is the descending direction, and the train numbers are compiled into odd numbers. In the form, the main red version and the blue version of the train ticket are different, the red version ticket is pink and is corresponding to a station window for ticket purchase; the blue ticket is blue and is corresponding to the ticket getting of the internet ticket buying station.

Under the social situation of economic mobility enhancement under the assistance of rail transit, the demand of financial automated identification application aiming at train tickets is urgent. According to investigation, at present, most enterprises and units still prefer high-speed rails in public lines, and a large number of railway tickets occupy most of traffic reimbursement contents, so reimbursement processing is urgently required. And at present stage to the mode of train ticket reimbursement management, still adopt traditional manual collection to type in the mode, and manual collection type needs invest a large amount of cost and time, has not only raised the operation cost, and inefficiency leads to invoice information in time effectively to transmit moreover, causes unnecessary fund to flow out, influences the performance of enterprises. After the train ticket scanning and identifying interface is applied, an enterprise can automatically acquire and input data of an invoice into an enterprise management system at the first time when the invoice is generated or received, so that the real-time effect is achieved, a large amount of time and cost are saved, and the method is an important choice for the society of the artificial intelligence era in the future.

Disclosure of Invention

The technical problem to be solved by the invention is to provide a layout extraction method of a train ticket image, which is used for automatically positioning and extracting a train ticket area from an invoice image, positioning a character line frame of the ticket surface content and automatically matching the layout, and then forming a retrieval method of the train ticket image content for calling a subsequent character recognition function.

In order to solve the technical problems, the invention adopts the technical scheme that:

a layout extraction method of a train ticket image comprises the following steps:

step 1: positioning a train ticket area in the digital image, and intercepting the train ticket image to standardize the scale and the gray scale of the train ticket, specifically:

1.1) reading a photoelectric sampling digital image of the invoice uploaded by a scanner;

1.2) preprocessing the image, including denoising and smoothing filtering;

1.3) converting the color image into a gray image;

1.4), calculating a pixel gray mean value Mb in a rectangular range with the central length and width of the image being 100 pixels for the pixels in the width range of 50 pixels at the left and right boundaries of the image;

1.5) calculating a binary image;

1.6), if Mc obtained in the step 1.4) is less than Mb, performing inversion operation on the binary image obtained in the step 1.5) to turn over black and white;

1.7) for the black-and-white image, extracting a white area in the black-and-white image by using a continuum detection algorithm, and extracting a counterclockwise sequence set of boundary points of the white area as a contour of a white pattern spot;

1.8) simplifying each contour obtained in the step 1.7) by using a contour simplification algorithm;

1.9), traversing all simplified contours, deleting all non-quadrilateral contours, and deleting all concave polygon contours, namely, remaining contours which have 4 vertexes and are convex quadrilaterals;

1.10) selecting the contour with the largest area from the contour set with 4 vertexes as the contour of the train ticket according to the most obvious precondition hypothesis of the scanned object;

1.11) calculating a projection transformation matrix according to the quadrilateral vertexes of the contour, and performing projection transformation on the train ticket image to obtain a standardized train ticket image; wherein the standardized train ticket size is given by pre-acquired prior knowledge;

1.12) carrying out histogram equalization on the standardized train ticket image to realize gray level standardization processing;

step 2: performing self-adaptive layout analysis on the standardized train ticket image; the method specifically comprises the following steps:

2.1) carrying out character line detection on the standardized invoice image to obtain a circumscribed rectangle frame set of a plurality of character lines; the character line detection is realized by a pre-trained YOLO target detection model, and a model file is pre-loaded into a memory to realize rapid detection;

2.2) calculating the average value of the heights of the character rows, deleting the undersized or oversized character rows in the character row set by respectively taking the height average value of 0.5 time and the height average value of 1.5 times as threshold values, and reserving the character rows with proper character size;

2.3) clustering the residual character lines by using a DBSCAN algorithm according to the vertical coordinates of the residual character lines; wherein, the threshold parameter of the DBSCAN cluster is set as 1 time of height mean value;

2.4) sorting the character rows aggregated to the same class according to the ascending order of the abscissa;

2.5) acquiring a character line arrangement rule according to the train ticket template, and distributing attributes to the clustered and sequenced character lines according to the rule to realize the correspondence between layout items and the character lines, wherein the train ticket template rule is acquired in advance;

2.6), outputting the layout analysis information.

Further, the step 1.8) is specifically as follows: and sequentially trying to delete each point on the contour, if the influence on the circumference of the contour after deletion is smaller than a set threshold value, really deleting the point, otherwise, keeping the point and converting to process the next point, and repeating the steps until all the points on the contour are processed.

Further, the step 1.10) is specifically as follows: and scanning and converting each polygon, then counting the total number of pixels corresponding to each outline as an area, selecting the serial number of the largest area according to the area, and correspondingly reading the quadrilateral coordinate of the largest area.

Compared with the prior art, the invention has the beneficial effects that:

1. the invention can adapt to the moderate difference of the shooting angle and the illumination.

2. The method not only can adapt to new train tickets after 2015, but also can adapt to old train tickets in 2011-2015, and has certain flexible processing capacity, so that the method can also adapt to subsequent fine reprinting of the train tickets automatically.

3. The processing of the invention can further refine the graphic range of the recognition processing before the flow of the recognition, reduce the algorithm load of the recognition and improve the efficiency macroscopically.

4. The method can be extended to other small invoice types.

5. The invention has no complex mechanical equipment, can effectively utilize the existing scanning equipment, and utilizes the algorithm module to expand the existing functions.

Drawings

Fig. 1 is a schematic diagram of a wire-frame template of a train ticket face and a layout thereof.

Fig. 2 is a schematic diagram of a layout analysis process of a train ticket.

Detailed Description

The present invention will be described in further detail with reference to the accompanying drawings and specific embodiments.

The invention aims to provide a layout analysis method for train tickets, which aims to solve the problems that a large number of train tickets are input in the current financial system, manual acquisition is relied on, and then efficiency is low and errors are frequent. The method can adapt to different appearances of red tickets and blue tickets and different micro-system differences in different periods in recent years, can automatically process under the conditions of complex scanning backgrounds and inclined scanning angles, utilizes a related method of intelligent digital image processing, lays a technical foundation for improving the information input efficiency of the train tickets, reduces the burden of the scanning maintenance work of the train tickets, ensures that the experience of participants is better than that of the traditional method, is easier to popularize and apply, and is more favorable for the normalization and the normalized popularization and application of the autonomous OCR recognition of the train tickets.

The invention utilizes invoice layout recognition software to form a layout extraction interface aiming at digital images of the train tickets acquired by a mobile phone, a high-speed shooting instrument and a numerical scanner. The layout analysis software runs in the background without a human-computer interaction interface. The software detects the scanned out catalog in the background, when a new unidentified image file appears, the image file is analyzed and processed, and the file is moved to the catalog marked as the analysis is completed. All processing can be automatically executed in the background of the system, and a key technical basis is provided for the automatic identification of the contents of the subsequent train tickets.

The basic idea of the invention is as follows: a railway ticket layout extraction method based on a computer vision technology is integrated in an intelligent invoice recognition server, takes a mobile phone photo, a high-speed shooting instrument image and scanning data of an invoice scanner as objects, and mainly comprises a railway ticket layout analysis algorithm module. The host computer reads the invoice digital image by driving the digital scanner, and then transmits the data to the train ticket layout analysis algorithm module, and the following detection steps are carried out:

the method comprises the following steps of firstly, positioning a train ticket area in a digital image, intercepting a train ticket image, and standardizing the scale and the gray scale of the train ticket image, wherein the method specifically comprises the following steps:

1) reading an invoice photoelectric sampling digital image uploaded by a scanner;

2) preprocessing the image, including denoising and smoothing filtering;

3) converting the color image into a gray image;

4) counting the average value Mb of the gray levels of the pixels in the range of 50 pixel widths at the left and right boundaries of the image, and calculating the average value Mc of the gray levels of the pixels in the rectangular range of which the length and the width of the center of the image are all 100 pixels;

5) calculating a binary image;

6) if Mc obtained in the step 4 is less than Mb, performing inversion operation on the binary image obtained in the step 5 to turn over the black and white;

7) for the black-and-white image, extracting a white area in the black-and-white image by using a continuum detection algorithm, and extracting a counterclockwise sequence set of boundary points of the white area as a contour of a white pattern spot;

8) and simplifying each contour obtained in the step 7 by using a contour simplification algorithm. The simplified process can be summarized as that each point on the contour is deleted in sequence, if the influence on the circumference of the contour after deletion is less than a set threshold value, the point is really deleted, otherwise, the point is kept and the next point is processed. The process is circulated until all points on the outline are processed;

9) traversing all simplified contours, deleting all non-quadrilateral contours, and deleting all concave polygon contours, namely, remaining contours which have 4 vertexes and are convex quadrilaterals; and judging whether the vector is a quadrangle or not by using the number of the vertexes, constructing two vectors by using two edges of each vertex, and calculating the sign of a cross product result of the two vectors. The cross-product sign of the edge vector for each vertex of the convex polygon should be the same, i.e., all negative or all positive. Thereby, contours that are obviously unlikely to be train tickets are excluded;

10) and selecting the contour with the largest area from the contour set with 4 vertexes as the contour of the train ticket according to the most obvious precondition hypothesis of the scanned object. The specific steps are that each polygon is scanned and converted, and then the total number of pixels corresponding to each outline is counted to be used as the area. Selecting the serial number of the largest area according to the area, and correspondingly reading the quadrilateral coordinate of the largest area;

11) calculating a projection transformation matrix according to the quadrilateral vertexes of the contour, and performing projection transformation on the train ticket image to obtain a standardized train ticket image; wherein the standardized train ticket size is given by pre-acquired prior knowledge;

12) and carrying out histogram equalization on the standardized train ticket image so as to realize gray level standardization processing.

And secondly, performing self-adaptive layout analysis on the standardized train ticket image, which specifically comprises the following steps:

1) carrying out character line detection on the standardized invoice image to obtain a circumscribed rectangle frame set of a plurality of character lines; the character line detection is realized by a pre-trained YOLO target detection model, and a model file is pre-loaded into a memory to realize rapid detection;

2) calculating the average value of the heights of the character rows, deleting the undersized or oversized character rows in the character row set by respectively taking the 0.5-time height average value and the 1.5-time height average value as threshold values, and reserving the character rows with proper character sizes; the deleted character lines are likely to be false detections;

3) and clustering the residual character lines by using a DBSCAN algorithm according to the vertical coordinates of the residual character lines. Wherein, the threshold parameter of the DBSCAN cluster is set as 1 time of height mean value; the character lines aggregated into one type are the same line in form, but are detected as a plurality of character lines due to a large space in the middle;

4) sorting the character rows aggregated to the same class according to the ascending order of the abscissa; the step is to arrange the characters in the same row in the order from left to right;

5) and acquiring a character line arrangement rule according to the train ticket template, and distributing attributes to the clustered and sequenced character lines according to the rule to realize the correspondence of layout items and the character lines. The train ticket template rule is acquired in advance, and can be briefly described as follows:

first row: the ticket number, i.e. serial number, of a train ticket

A second row: train ticket issuing station, train number and terminal station

Third row: chinese phonetic alphabet of initial station and terminal station

Fourth row: driving time and seat information

The fifth element: fare, seat class

A sixth row: time of day

The seventh row: identity card number and name

The last row is as follows: sales information coding

In addition, between the seventh line and the last line, there may be advertisement information and a two-dimensional code, but since the two-dimensional code does not belong to the core content to be OCR-recognized, no template consideration is made. Therefore, the rule for matching character lines according to the template is:

the first character row of the first row corresponds to the ticket number;

the first character line of the second line corresponds to the starting station, the second character line corresponds to the train number, and the third character line corresponds to the destination station;

the first character row of the third row corresponds to the Chinese pinyin of the initial station, and the second character row corresponds to the Chinese pinyin of the destination station;

the first character line of the fourth line corresponds to driving time, and the second character line corresponds to seat information;

the first character row of the fifth row corresponds to the fare and the second character row corresponds to the seat class;

the first character line of the sixth line corresponds to the effective time of the train number;

the first character line of the seventh line corresponds to the identity card number and the name;

the first character line of the last line corresponds to the selling information code;

if the number of the character lines in a certain line is insufficient, the character detection is omitted, all detection output cannot be realized, and the image detection needs to be prompted to be collected again in feedback information;

6) outputting layout analysis information;

7) and exiting.

Because the railway ticket layout analysis is directly processed in the high-resolution image, the processing result can be directly processed by a subsequent invoice identification module. The range of train ticket character recognition is simplified from full image search to designated area search, the calculation complexity is greatly reduced, and the train ticket recognition process is accelerated. Although a high quality layout analysis process may increase the number of operations properly, the recognition rate is still improved from a global perspective due to the improved hit rate.

Table 1 hardware is tabulated below:

name (R)	Model number
		Digital image high-speed shooting instrument	2000 ten thousand pixels A3/A4
Display screen	17inch liquid crystal
		Invoice recognition service computer	I7 16G GTX2080Ti
User prompting device	FM buzzer

Description of hardware connection: the invoice image acquisition equipment is connected with the invoice identification computer module through a USB line. The computer is provided with a driving program and an application program of the scanner/high-speed shooting instrument. The contents of the train ticket are output as digital images after being assembled and converted through an optical signal AD in the scanner, and are automatically uploaded to an appointed directory of an invoice recognition computer through an equipment controller and a driving program after the scanning is finished, and the contents are stored as jpg format files according to time sequences. The train ticket recognition computer reads the data from the memory and calls the layout analysis module to process the data. The processing result is also stored in a server hard disk in a file form for a subsequent character recognition link.

Claims

1. A layout extraction method of a train ticket image is characterized by comprising the following steps:

1.2) preprocessing the image, including denoising and smoothing filtering;

1.3) converting the color image into a gray image;

1.5) calculating a binary image;

2.6), outputting the layout analysis information.

2. The layout extraction method of a train ticket image as claimed in claim 1, wherein the step 1.8) is specifically as follows: and sequentially trying to delete each point on the contour, if the influence on the circumference of the contour after deletion is smaller than a set threshold value, really deleting the point, otherwise, keeping the point and converting to process the next point, and repeating the steps until all the points on the contour are processed.

3. The layout extraction method of a train ticket image as claimed in claim 1, wherein the step 1.10) is specifically as follows: and scanning and converting each polygon, then counting the total number of pixels corresponding to each outline as an area, selecting the serial number of the largest area according to the area, and correspondingly reading the quadrilateral coordinate of the largest area.