CN110543844A

CN110543844A - metadata extraction method for government affair metadata PDF file

Info

Publication number: CN110543844A
Application number: CN201910791805.5A
Authority: CN
Inventors: 昌攀; 曹扬; 胥月; 张鹏翔
Original assignee: Division Big Data Research Institute Co Ltd
Current assignee: Division Big Data Research Institute Co Ltd
Priority date: 2019-08-26
Filing date: 2019-08-26
Publication date: 2019-12-06

Abstract

The invention provides a metadata extraction method of a government affair metadata PDF file, which is used for processing the government affair metadata PDF file so that different types of metadata PDF files can be processed by OCR; then, using an OCR character recognition engine to recognize the content in the PDF file; and finally, extracting important information such as fields, attribute values and the like of the metadata by a template matching method of government affair metadata information, and inputting the information into a system, thereby realizing automatic extraction of standard metadata PDF files and improving the efficiency.

Description

Metadata extraction method for government affair metadata PDF file

Technical Field

the invention relates to a metadata extraction method for a government affair metadata PDF file, belongs to the technical fields of natural language processing, artificial intelligence and the like, and particularly relates to a metadata extraction method for a government affair metadata PDF file based on OCR.

background

with the deep advance of big data and intelligent government strategies such as national electronic government affairs, digital governments, digital China and the like, various government departments increasingly publicize and push policy laws, news reports and standard specifications to the public in a network mode, so that a large number of official document announcements of a government metadata standard system are generated, and according to incomplete statistics, the number of official documents released by national ministry through a government open website in the last five years exceeds 10 thousands. Under such a background, it becomes a great challenge to extract relevant field names and attribute values for the large number of government affair metadata files, and to enter the extracted field names and attribute values into a system for automatic comparison, reference and other operations.

In the face of increasingly heavy government affair metadata standard file information extraction operations, it is very difficult to correctly extract fields and attribute values related to metadata in the files, and for a general standard PDF file, the contents can be in a text type or a picture type, and have no uniform standard, so that certain difficulty is brought to automatic extraction of a machine; generally, the data is input into a computer system by adopting a manual extraction method, but the standard files are huge in quantity and numerous in metadata entries, so that huge manpower and material resources are consumed, and the efficiency is low. Therefore, a method for extracting metadata with high accuracy and capable of automatically extracting different types of PDF texts is urgently needed, and the method for extracting metadata of a government affair metadata PDF file based on OCR is a feasible solution.

in the prior art, there are many image type text recognition algorithms based on OCR at present, and how to effectively integrate OCR recognition methods and recognize different PDF files is contained in the composition of current government affair PDF file metadata standard PDF files, and the main metadata attributes of corresponding metadata "definition", "english name" and "data type" are extracted from recognized text information in a template matching manner.

disclosure of Invention

in order to solve the technical problems, the invention provides a metadata extraction method for a government affair metadata PDF file, which can simultaneously process PDF files in the text type and the picture type.

the invention is realized by the following technical scheme.

The invention provides a metadata extraction method of a government affair metadata PDF file, which comprises the following steps:

(ii) government affairs metadata PDF text: inputting a PDF text, converting text type data in the PDF text into image type data, and acquiring a PDF text of full image type data;

OCR character recognition: preprocessing a full-image type data PDF text to obtain character information data of the full-image type data PDF text;

thirdly, recognizing the character template: inputting character information data, extracting field information data, performing character recognition, and completing metadata extraction.

in the first step, a PDF text is input, text type data in the PDF text is converted into image type data, the content of the PDF text is identified through an OCR character recognition engine, and a full image type data PDF text is obtained.

the step II comprises the following steps:

(2.1) text input: inputting a full image type data PDF text;

(2.2) blurring: carrying out fuzzy judgment on the full image type data PDF text, if the full image type data PDF text has a fuzzy image, processing the fuzzy image to obtain a clear image, and executing the step (2.3); if the PDF text of the full image type data is a clear image, executing the step (2.3);

(2.3) binarization: carrying out binarization processing on the PDF text of the full-image type data;

(2.4) denoising: denoising the full image type data PDF text;

(2.5) inclination correction: carrying out inclination judgment on the full image type data PDF text, if the full image type data PDF text is inclined, adopting an image inclination correction algorithm to correct, obtaining the inclination-free full image type data PDF text, and entering the step (2.6); if the PDF text of the full image type data is not inclined, the step (2.6) is carried out;

(2.6) character cutting: segmenting text information in a full-image type data PDF text, and segmenting the text information into single characters;

(2.7) feature extraction: extracting the character features by a gridding character feature extraction method through normalization processing to obtain 13-dimensional feature vectors;

(2.8) feature matching: performing character feature matching on the extracted 13-dimensional character vector and data in a full word set feature library, and selecting a word with the maximum recognition probability;

(2.9) error correction processing: carrying out error correction processing on the recognized characters, if the recognized characters are misplaced, updating the full character set feature library, and if the recognized characters are not misplaced, entering the step (1.10);

(2.10) text information data: and acquiring character information data.

the third step is divided into the following steps:

(3.1) converting the format: inputting character information data, and converting the character information data into a JSON data format through an OCR character recognition engine;

And (3.2) filtering the text information data: filtering invalid characters in the text information data;

(3.3) substitution &: dividing character information data into four attributes of definition, English name, data type and value field, replacing definition and value field with &, and segmenting according to &toobtain a target paragraph;

(3.4) traversing: traversing the target paragraphs, simultaneously carrying out fuzzy comparison on whether the English name and the data type exist in the target paragraphs, and marking the target paragraphs meeting the conditions as the fields to be extracted;

(3.5) segmenting the field to be extracted: the fields to be extracted are used ": "split, split the characters of definition, English name, data type and other contents;

(3.6) extracting specific text data: identifying and extracting other contents of each segmented section by adopting a target identification method, storing fields meeting conditions, and discarding field groups with the length less than 3;

(3.7) JSON data format: packaging the fields meeting the conditions in a reassembling mode, and respectively adding fields of 'definition', 'name' and 'data type' to form a new JSON data format;

(3.8) metadata: and completing metadata extraction.

and in the third step, the fields and the attribute values in the government affair metadata PDF file are extracted in a template matching mode.

In the step (3.2), the invalid characters comprise spaces and illegal characters.

The other contents are as follows: data other than definitions, english names, data types.

in the step (2.2), the image is converted into a frequency domain, and high-frequency components are extracted to carry out fuzzy judgment on the PDF text of the full-image type data; and processing the blurred image by adopting a high-pass filtering method in an image enhancement algorithm.

in the step (2.5), a rectangular region in the full image type data PDF text is selected, the rectangular region of the given full image type data PDF text is identified by using a text line tracking algorithm to obtain an inclination angle of the full image type data PDF text, and the inclination judgment of the full image type data PDF text is performed according to the inclination angle.

and in the step (3.2), filtering invalid characters in the text information data through an MSER Tree algorithm.

the invention has the beneficial effects that: the defects that automation cannot be realized, manual extraction is relied on manpower and the like in the process of extracting the metadata field and the attribute value of the government affair metadata PDF file in the prior art are overcome; the government affair PDF file of text or picture content is automatically identified by the OCR engine, and the relevant information of the metadata is automatically extracted by using a template matching mode according to the identified content and is input into the system, so that the metadata information input efficiency is improved, and the error rate of manual identification is reduced.

Drawings

FIG. 1 is a flow chart of the present invention.

Detailed Description

The technical solution of the present invention is further described below, but the scope of the claimed invention is not limited to the described.

as shown in fig. 1, a metadata extraction method for a government affair metadata PDF file includes the following steps:

the step II comprises the following steps:

(2.1) text input: inputting a full image type data PDF text;

(2.2) blurring: converting the image into a frequency domain, extracting high-frequency components to perform fuzzy judgment on a full-image type data PDF text, if the full-image type data PDF text has a fuzzy image, processing the fuzzy image by adopting a high-pass filtering method in an image enhancement algorithm to obtain a clear image, and executing the step (2.3); if the PDF text of the full image type data is a clear image, executing the step (2.3);

(2.4) denoising: denoising the full image type data PDF text;

(2.5) inclination correction: selecting a rectangular area in the image, identifying the rectangular area of the image by using a text line tracking algorithm to obtain an inclination angle of the image, performing inclination judgment on the PDF text of the full image type data according to the inclination angle, if the PDF text of the full image type data is inclined, correcting by using an image inclination correction algorithm to obtain the PDF text of the full image type data without inclination, and entering the step (2.6); if the PDF text of the full image type data is not inclined, the step (2.6) is carried out;

(2.7) feature extraction: performing feature extraction on the characters by adopting a gridding character feature extraction method, converting an outer frame into 32 × 16 pixel values through normalization processing, dividing an image into 3 × 3 ═ 9 small grids, and counting the number of black pixels in each grid to form a 9-dimensional vector; the other side is to obtain the characteristic value of the intersection point, and horizontal and vertical line penetration characters are made at the trisection positions in the horizontal and vertical directions, and the intersection times of the characters and the edge are calculated to obtain 4 values, so that 13-dimensional characteristic vectors are obtained;

(2.10) text information data: and acquiring character information data.

the third step is divided into the following steps:

And (3.2) filtering the text information data: filtering invalid characters in the text information data through an MSER Tree algorithm;

(3.8) metadata: and completing metadata extraction.

Example 1

as described above, a method for processing PDF files of text type and picture type simultaneously increases the sources of text information recognized by PDF compared with a conventional OCR engine, and includes the following steps:

step1.1, converting the text type PDF file into an image type PDF file for processing;

Step1.2, judging whether the processed PDF file is fuzzy, and if so, processing the fuzzy file;

Step1.3, if the PDF file is not a binary image, the PDF file needs to be subjected to binarization processing, the obtained image can reproduce the original Chinese characters, and the key of binarization is to select a threshold value which is usually represented by a threshold operator in the form of a ternary function.

step1.4, carrying out PDF file denoising treatment, improving the definition of the PDF file, and further improving the accuracy of OCR recognition;

Step1.5, if the image of the PDF file is inclined, correcting by using an image inclination correction algorithm (as mentioned in the prior art: https:// closed.content.com/leveller/angle/1094604), and ensuring the non-inclined image text required by the subsequent operation;

Step1.6, segmenting the text by utilizing a certain blank gap between characters in the image text;

step1.7, after line segmentation is finished, text lines need to be segmented into single characters, namely a task needing word segmentation is carried out, left and right boundaries of single words of each line of characters are searched from left to right, and single words or punctuation marks are segmented;

step1.8, converting the outer frame into 32 x 16 pixel values by normalization processing by adopting a gridding character feature extraction method; then, dividing the image into 9 small grids with the size of 3 multiplied by 3, and counting the number of black pixels in each grid to form a 9-dimensional vector; the other side is to obtain the characteristic value of the intersection point, and horizontal and vertical line penetration characters are made at the trisection positions in the horizontal and vertical directions, and the intersection times of the characters and the edge are calculated to obtain 4 values, and 13-dimensional characteristic vectors are obtained;

Step1.9, performing character feature matching on the extracted 13-dimensional feature vector of the character and data in a full word set feature library, and selecting a word with the maximum recognition probability;

step1.10, error correction processing is carried out on the recognized characters, if the recognized characters are misplaced, the full character set feature library is updated, and similar character recognition errors are avoided;

Step1.11, finally outputting the text information of the PDF file;

Secondly, removing unnecessary symbols and spaces in the text by using a related natural language text processing method for the character output information of the government affair metadata standard PDF file, labeling by adopting a text substitution method, segmenting the labeled text, and then identifying a target text segment according to the characteristics of the metadata; after proceeding with the target text segment ": the number is segmented to obtain short text information containing metadata information; and finally, extracting the target short text information, assembling and returning to the system for recording.

Specifically, extracting corresponding fields and attribute values in the government affair metadata PDF file based on a template matching mode comprises the following steps

step2.1, the data returned by the OCR character recognition engine contains other contents except the text information, and needs to be converted into a JSON data format to obtain the text information of the relevant text data information field;

step2.2, because the accuracy of PDF text recognition cannot achieve 100% accuracy at present, the recognized text information contains many unnecessary information, such as: spaces, illegal characters and the like, wherein the spaces and the illegal characters need to be removed to obtain processed text information;

Step2.3, according to investigation on text information of a large amount of government affairs metadata, a common metadata file contains four common attributes of definition, English name, data type and value field, and in order to cut the text, the definition and the value field need to be replaced by "&" for further processing;

step2.4, segmenting the text according to the "&" symbol, thus segmenting three commonly used attribute names including a defined value, an English name and a data type and a value into a target paragraph;

Step2.5, traversing all the target paragraphs, simultaneously carrying out fuzzy comparison on whether the target paragraphs have words such as English names, data types and the like, and marking the target paragraphs meeting the conditions as the fields to be extracted;

step2.6, use the field information to be extracted ": "split, split the character and content of definition, English name, data type;

step2.7, identifying and extracting the content of each section by adopting a target identification mode, storing the fields meeting the conditions, and discarding the field group with the length less than 3;

step2.8, packaging the fields meeting the conditions in a reassembling mode, and adding fields of 'definition', 'name' and 'data type' respectively to form a JSON data format;

Step2.9, inputting the target data in the JSON format into the system, and integrally realizing the metadata information input of different types of PDF files.

preferably, the extracted government affair metadata field and the attribute value can be the same as those of the original file, and different places only need to be slightly modified, so that the extraction efficiency is greatly improved, the manpower and material resources are saved, and the aim of improving the working efficiency is fulfilled compared with a manual metadata extraction mode.

furthermore, the method for extracting the government affair metadata PDF file and the metadata, which is constructed by the method, can efficiently and automatically extract the metadata fields and the attribute values in the government affair metadata PDF files with different text types, solves the heavy workload of manual extraction, enables the metadata extraction of the government affair metadata PDF file to be more efficient and intelligent, solves the problem that a user faces heavy government affair metadata attribute value processing, and improves the handling efficiency.

In summary, the method processes the government affair metadata PDF files, so that different types of metadata PDF files can be processed by OCR; then, using an OCR character recognition engine to recognize the content in the PDF file; and finally, extracting important information such as fields, attribute values and the like of the metadata by a template matching method of government affair metadata information, and inputting the information into a system, thereby realizing automatic extraction of standard metadata PDF files and improving the efficiency.

Claims

1. a metadata extraction method for a government affair metadata PDF file is characterized by comprising the following steps: the method comprises the following steps:

2. the metadata extraction method of a government affairs metadata PDF file according to claim 1, wherein: in the first step, a PDF text is input, text type data in the PDF text is converted into image type data, the content of the PDF text is identified through an OCR character recognition engine, and a full image type data PDF text is obtained.

3. the metadata extraction method of a government affairs metadata PDF file according to claim 1, wherein: the step II comprises the following steps:

(2.1) text input: inputting a full image type data PDF text;

(2.4) denoising: denoising the full image type data PDF text;

(2.10) text information data: and acquiring character information data.

4. the metadata extraction method of a government affairs metadata PDF file according to claim 1, wherein: the third step is divided into the following steps:

(3.8) metadata: and completing metadata extraction.

5. the metadata extraction method of a government affairs metadata PDF file according to claim 1, wherein: and in the third step, the fields and the attribute values in the government affair metadata PDF file are extracted in a template matching mode.

6. the metadata extraction method of a government affairs metadata PDF file according to claim 4, wherein: in the step (3.2), the invalid characters comprise spaces and illegal characters.

7. The metadata extraction method of a government affairs metadata PDF file according to claim 4, wherein: the other contents are as follows: data other than definitions, english names, data types.

8. A metadata extraction method of a government affairs metadata PDF file according to claim 3, wherein: in the step (2.2), the image is converted into a frequency domain, and high-frequency components are extracted to carry out fuzzy judgment on the PDF text of the full-image type data; and processing the blurred image by adopting a high-pass filtering method in an image enhancement algorithm.

9. A metadata extraction method of a government affairs metadata PDF file according to claim 3, wherein: in the step (2.5), a rectangular region in the full image type data PDF text is selected, the rectangular region of the given full image type data PDF text is identified by using a text line tracking algorithm to obtain an inclination angle of the full image type data PDF text, and the inclination judgment of the full image type data PDF text is performed according to the inclination angle.

10. the metadata extraction method of a government affairs metadata PDF file according to claim 4, wherein: and in the step (3.2), filtering invalid characters in the text information data through an MSER Tree algorithm.