CN110210455B

CN110210455B - Printing content formatting extraction method

Info

Publication number: CN110210455B
Application number: CN201910526081.1A
Authority: CN
Inventors: 夏莫戛; 张文静; 甘玉涛; 樊利红
Original assignee: Shijiazhuang Jiehong Technology Co ltd
Current assignee: Shijiazhuang Jiehong Technology Co ltd
Priority date: 2019-06-18
Filing date: 2019-06-18
Publication date: 2022-03-01
Anticipated expiration: 2039-06-18
Also published as: CN110210455A

Abstract

The invention relates to the technical field of document printing, in particular to a method for formatting and extracting printing content, which comprises the following steps: s1, intercepting and converting printing contents of a printing document into printing elements to generate a printing element set; s2, designing extraction elements according to the sampled printing element set to generate an extraction template; and S3, inputting the printing element set and the extraction template, and performing operation by using an extraction engine to generate a formatted extraction result. The method for extracting the printing content in the formatted mode effectively overcomes the defect of extracting the pure text content, and can flexibly, efficiently and accurately extract the content in the complex form. The OCR form is effectively supplemented and optimized. The extraction of precise coordinates is innovatively improved, and the container extraction elements are embedded into the combination of basic extraction elements, so that complex extraction forms can be effectively dealt with. The visual template design interface greatly simplifies the design difficulty and improves the design efficiency.

Description

Printing content formatting extraction method

Technical Field

The invention relates to the technical field of document printing, in particular to a method for formatting and extracting printing content.

Background

At present, the printout is an indispensable content output mode in various industries, but the printout content is only suitable for being watched and read by human eyes, the output content cannot be effectively formatted again, and the secondary processing of the data is not facilitated. In the current era of big data flow, a way to reformat the printout content of other systems is urgently needed, so that the disclosed effective data can be reused in a low-cost and efficient way without data interface authorization. And a basic data acquisition solution is provided for applications such as big data calculation, artificial intelligence and the like.

There are three main ways of extracting content. Firstly, pure text printing content is obtained, and character segmentation and searching matching are carried out aiming at special keywords. Secondly, the printing content is completely converted into pictures, and the content is extracted by utilizing an OCR technology. Thirdly, analyzing the printing standard, acquiring accurate content and matched coordinate information, and extracting the content by utilizing the coordinate partition.

The three extraction modes have the advantages and the disadvantages: the advantage of the first approach is that the way to obtain the underlying data is simple. The method has the disadvantages that complex information cannot be accurately extracted, and analysis errors are easily generated for a large amount of nonstandard table data (such as missing row and column data). The second mode has the advantages that the extraction area can be freely defined, and various types of printing contents can be uniformly converted into pictures for processing. The defects are that the accuracy of the content analyzed by the general OCR is not high, or higher accuracy and performance (high technical implementation difficulty) are obtained after the OCR is trained by relying on big data. The third mode has the advantages that the content is accurate and does not need to be analyzed, and the content is convenient to divide with coordinates. The disadvantage is that it is inconvenient to combine scattered data, and some data which is originally picture content cannot be processed.

Disclosure of Invention

The invention aims to provide a printing content formatting extraction method, which aims to solve the problem of difficult extraction of complex content in the background technology; the method mainly comprises the following steps: the problem that the number of lines of the extracted form is uncertain and the number of lines cannot be accurately determined before extraction is solved; the size of the table row is different, and the influence is caused to the extraction of the partitioned area; the problem of form data paging display extraction; extracting the problem of removing content interference information; the problem of flexible conversion of the image-text mixed extraction mode; extracting the problem of information floating positioning.

In order to achieve the purpose, the invention provides the following technical scheme:

a method for formatting and extracting print contents comprises the following steps:

s1, intercepting and converting the printing content of a printing document into printing elements (including the text content, x and y coordinates of the left upper corner of a corresponding page and height and width information of the displayed text content), and generating a printing element set (including the name of the printing document, the total number of printing pages, the index number of each page, the height and width of each page, the printing elements contained in each page and independent page pictures of each page);

s2, designing extraction elements (mainly comprising extraction element types, keywords, extraction ranges (extracting x and y coordinates, height and width, and extracting elements can be nested) and other special type attribute information) according to the sampled printing element set, and generating an extraction template;

and S3, inputting the printing element set and the extraction template, and performing operation by using an extraction engine to generate a formatted extraction result (including all data extracted by the extraction elements, and forming key value pair data by using the keywords and the extracted contents).

As a further scheme of the invention: in step S2, the extraction template includes an extraction template name, a plurality of extraction elements, and a set of processing scripts; the extraction elements include basic extraction elements or container extraction elements, which may be nested combinations.

As a still further scheme of the invention: the basic extraction elements comprise text extraction elements or bar code extraction elements; the text extraction element comprises an extraction key value and a group of coordinates, the group of coordinates is used for dividing an area relative to the current page and extracting printing elements in the area, and the extraction key value is used for generating a key value pair from the extracted content.

As a still further scheme of the invention: the container extraction element comprises a form extraction element; the form extraction element is provided with a plurality of basic text extraction elements, and the coordinates of the text extraction elements are relative to the parent container form extraction element.

As a still further scheme of the invention: the specific implementation method of step S1 is:

s1-1, converting the printed document into an EMF file by using a formatted virtual printer;

s1-2, analyzing the EMF file, extracting coordinates and contents, and generating a printing element document;

s1-3, each printed page is analyzed and converted into a page picture.

As a still further scheme of the invention: the specific implementation method of step S2 is:

s2-1, processing by using a quick slide printing formatting extraction template design client;

s2-2, importing printing element set sample data;

s2-3, dragging and setting extraction elements by using a mouse with the aid of a visual interface, and setting related extraction parameters;

s2-4, testing extraction and checking extraction results, if not satisfied, repeating the steps S2-2 to S2-4 until the extraction results of a plurality of printing samples in the same format are satisfied;

s2-5, storing the printing extraction template, uploading the template to a printing formatting extraction server, and binding the printing type.

As a still further scheme of the invention: the specific implementation method of step S3 is:

s3-1, uploading the generated printing element document and page picture to a printing formatting extraction server;

s3-2, the printing formatting extraction server calls the designed printing extraction template according to the uploaded related printing types;

and S3-3, the extraction engine automatically performs formatting extraction according to the known input information operation, and stores the extraction result in a database.

As a still further scheme of the invention: in step S3-3, the extraction engine operates as follows:

s3-3-1, traversing all pages, and packaging the printing elements of the current page and the page pictures together as the following input parameters;

s3-3-2, traversing all top-level extraction elements on the current page, and performing extraction operation:

s3-3-2-1, if the extracted element is a basic extracted element, such as a text extracted element or a bar code extracted element, directly matching the extracted result of the extracted element with the key word of the extracted element to form a key value pair and returning the key value pair;

s3-3-2-2, if the extraction element is a container extraction element, such as a form extraction element, traversing all sub extraction elements, extracting, forming a queue by extraction results of the sub extraction elements, and forming a key value pair to return by matching with keywords of the container extraction element;

s3-3-3, converting all returned key value pairs into formatted extraction results in json format;

and S3-3-4, transmitting the formatted extraction result to a processing script in a parameter form, and performing secondary processing by the processing script or directly returning the result without any change.

Compared with the prior art, the invention has the beneficial effects that:

the method for extracting the printing content in the formatted mode solves the problem that complex content is difficult to extract, and mainly comprises the following steps: the problem that the number of lines of the extracted form is uncertain and the number of lines cannot be accurately determined before extraction is solved; the size of the table row is different, and the influence is caused to the extraction of the partitioned area; the problem of form data paging display extraction; extracting the problem of removing content interference information; the problem of flexible conversion of the image-text mixed extraction mode; extracting the problem of information floating positioning.

The method for extracting the printing content in the formatted mode effectively overcomes the defect of extracting the pure text content, and can flexibly, efficiently and accurately extract the content in the complex form. The OCR form is effectively supplemented and optimized, and the calculation efficiency of the OCR is effectively improved in an accurate defined range. The extraction of accurate coordinates is innovatively improved, the combination of embedding basic extraction elements in container extraction elements can effectively deal with complex extraction forms, and the extraction method is used for processing various difficult extraction problems of form contents. The visual template design interface greatly simplifies the design difficulty and improves the design efficiency.

Drawings

FIG. 1 is a block flow diagram of an embodiment of the present invention.

Detailed Description

The technical solution of the present patent will be described in further detail with reference to the following embodiments.

Referring to fig. 1, in an embodiment of the present invention, a method for formatting and extracting print content includes the following steps:

Further, in step S2, the extraction template includes an extraction template name, a plurality of extraction elements, and a set of processing scripts; the extraction elements include basic extraction elements or container extraction elements, which may be nested combinations.

Specifically, the basic extraction element comprises a text extraction element or a barcode extraction element; the text extraction element comprises an extraction key value and a group of coordinates, the group of coordinates is used for dividing an area relative to the current page and extracting printing elements in the area, and the extraction key value is used for generating a key value pair from the extracted content.

Specifically, the container extraction element comprises a form extraction element; the form extraction element is provided with a plurality of basic text extraction elements, and the coordinates of the text extraction elements are relative to the parent container form extraction element.

Specifically, the specific implementation method of step S1 is as follows:

s1-1, converting the printed document into an EMF file by using a formatted virtual printer, and specifically, printing by using a quick transport formatted virtual printer;

s1-2, analyzing the EMF file, extracting coordinates and contents, and generating a printing element document (jhcef format file);

s1-3, analyzing each printed page and converting the page into a page picture; in particular, a jpg picture may be converted.

Specifically, the specific implementation method of step S2 is as follows:

s2-2, importing printing element set sample data;

Specifically, the specific implementation method of step S3 is as follows:

and S3-3, the extraction engine automatically performs formatting extraction according to known input information operation, and stores the extraction result into a database, wherein the format of the document of the formatting extraction result is jhcer.

Further, in step S3-3, the extraction engine operates as follows:

The invention comprehensively utilizes the advantages of the prior schemes, uses the proper scheme combination under the proper environment and achieves the optimal extraction formatting effect. The invention designs an extraction template according to the printing elements with coordinates. The extraction template comprises a plurality of extraction elements and a group of processing scripts. The extraction elements are divided into text extraction elements, form extraction elements and bar code extraction elements. The text extraction element is the most basic extraction element and comprises a set of coordinates which can define an area relative to the current page for extracting the printing elements in the area. In addition, the method also comprises extracting key values which are used for generating key value pairs from the extracted contents. The form extraction element is a container extraction element that requires multiple underlying text extraction elements to be placed in it, with coordinates relative to its parent container form extraction element. By utilizing the visual interface, a user can conveniently set the extraction template by clicking and dragging a mouse. And then, the printing elements and the extraction template are delivered to an extraction engine for calculation, and an extraction result in a json format is obtained after calculation. The method for extracting the printing content in the formatted mode effectively overcomes the defect of extracting the pure text content, and can flexibly, efficiently and accurately extract the content in the complex form. The OCR form is effectively supplemented and optimized, and the calculation efficiency of the OCR is effectively improved in an accurate defined range. The extraction of accurate coordinates is innovatively improved, the combination of embedding basic extraction elements in container extraction elements can effectively deal with complex extraction forms, and the extraction method is used for processing various difficult extraction problems of form contents. The visual template design interface greatly simplifies the design difficulty and improves the design efficiency.

While the preferred embodiments of the present invention have been described in detail, the present invention is not limited to the above embodiments, and various changes can be made without departing from the spirit of the present invention within the knowledge of those skilled in the art.

Claims

1. A method for formatting and extracting print contents is characterized by comprising the following steps:

s1, intercepting and converting printing contents of a printing document into printing elements to generate a printing element set;

s2, designing extraction elements according to the sampled printing element set to generate an extraction template;

s3, inputting a printing element set and an extraction template, and performing operation by using an extraction engine to generate a formatted extraction result;

in step S2, the extraction template includes an extraction template name, a plurality of extraction elements, and a set of processing scripts; the extraction element comprises a container extraction element;

the container extraction element comprises a form extraction element; the form extraction element is provided with a plurality of basic text extraction elements, and the coordinates of the text extraction elements are relative to the parent container form extraction element;

the specific implementation method of step S1 is:

s1-3, analyzing each printed page and converting the page into a page picture;

the specific implementation method of step S2 is:

s2-1, processing by using a printing formatting extraction template design client;

s2-2, importing printing element set sample data;

s2-5, storing the printing extraction template, uploading the printing extraction template to a printing formatting extraction server, and binding the printing type;

the specific implementation method of step S3 is:

s3-3, the extraction engine automatically performs formatted extraction according to the known input information operation, and stores the extraction result in a database;

in step S3-3, the extraction engine operates as follows:

s3-3-2, traversing all extraction elements on the current page, and performing extraction operation;

if the extraction element is a container extraction element, traversing all the sub-extraction elements, extracting, forming a queue by the extraction results of the sub-extraction elements, and forming a key value pair to return by matching with the key words of the container extraction element;