CN110543844A - metadata extraction method for government affair metadata PDF file - Google Patents

metadata extraction method for government affair metadata PDF file Download PDF

Info

Publication number
CN110543844A
CN110543844A CN201910791805.5A CN201910791805A CN110543844A CN 110543844 A CN110543844 A CN 110543844A CN 201910791805 A CN201910791805 A CN 201910791805A CN 110543844 A CN110543844 A CN 110543844A
Authority
CN
China
Prior art keywords
text
pdf
metadata
data
image type
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910791805.5A
Other languages
Chinese (zh)
Inventor
昌攀
曹扬
胥月
张鹏翔
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Division Big Data Research Institute Co Ltd
Original Assignee
Division Big Data Research Institute Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Division Big Data Research Institute Co Ltd filed Critical Division Big Data Research Institute Co Ltd
Priority to CN201910791805.5A priority Critical patent/CN110543844A/en
Publication of CN110543844A publication Critical patent/CN110543844A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q50/00Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
    • G06Q50/10Services
    • G06Q50/26Government or public services
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06TIMAGE DATA PROCESSING OR GENERATION, IN GENERAL
    • G06T5/00Image enhancement or restoration
    • G06T5/73Deblurring; Sharpening
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/20Image preprocessing
    • G06V10/26Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion
    • G06V10/267Segmentation of patterns in the image field; Cutting or merging of image elements to establish the pattern region, e.g. clustering-based techniques; Detection of occlusion by performing operations on regions, e.g. growing, shrinking or watersheds
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V10/00Arrangements for image or video recognition or understanding
    • G06V10/70Arrangements for image or video recognition or understanding using pattern recognition or machine learning
    • G06V10/74Image or video pattern matching; Proximity measures in feature spaces
    • G06V10/75Organisation of the matching processes, e.g. simultaneous or sequential comparisons of image or video features; Coarse-fine approaches, e.g. multi-scale approaches; using context analysis; Selection of dictionaries
    • G06V10/751Comparing pixel values or logical combinations thereof, or feature values having positional relevance, e.g. template matching
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/14Image acquisition
    • G06V30/146Aligning or centring of the image pick-up or image-field
    • G06V30/1475Inclination or skew detection or correction of characters or of image to be recognised
    • G06V30/1478Inclination or skew detection or correction of characters or of image to be recognised of characters or characters lines
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition
    • G06V30/14Image acquisition
    • G06V30/148Segmentation of character regions
    • G06V30/153Segmentation of character regions using recognition of characters or words
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/40Document-oriented image-based pattern recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/40Document-oriented image-based pattern recognition
    • G06V30/41Analysis of document content
    • G06V30/418Document matching, e.g. of document images
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Business, Economics & Management (AREA)
  • Artificial Intelligence (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Tourism & Hospitality (AREA)
  • Medical Informatics (AREA)
  • Economics (AREA)
  • Software Systems (AREA)
  • Databases & Information Systems (AREA)
  • Development Economics (AREA)
  • Educational Administration (AREA)
  • Computing Systems (AREA)
  • Evolutionary Computation (AREA)
  • Human Resources & Organizations (AREA)
  • Marketing (AREA)
  • Primary Health Care (AREA)
  • Strategic Management (AREA)
  • General Business, Economics & Management (AREA)
  • Character Discrimination (AREA)

Abstract

The invention provides a metadata extraction method of a government affair metadata PDF file, which is used for processing the government affair metadata PDF file so that different types of metadata PDF files can be processed by OCR; then, using an OCR character recognition engine to recognize the content in the PDF file; and finally, extracting important information such as fields, attribute values and the like of the metadata by a template matching method of government affair metadata information, and inputting the information into a system, thereby realizing automatic extraction of standard metadata PDF files and improving the efficiency.

Description

Metadata extraction method for government affair metadata PDF file
Technical Field
the invention relates to a metadata extraction method for a government affair metadata PDF file, belongs to the technical fields of natural language processing, artificial intelligence and the like, and particularly relates to a metadata extraction method for a government affair metadata PDF file based on OCR.
background
with the deep advance of big data and intelligent government strategies such as national electronic government affairs, digital governments, digital China and the like, various government departments increasingly publicize and push policy laws, news reports and standard specifications to the public in a network mode, so that a large number of official document announcements of a government metadata standard system are generated, and according to incomplete statistics, the number of official documents released by national ministry through a government open website in the last five years exceeds 10 thousands. Under such a background, it becomes a great challenge to extract relevant field names and attribute values for the large number of government affair metadata files, and to enter the extracted field names and attribute values into a system for automatic comparison, reference and other operations.
In the face of increasingly heavy government affair metadata standard file information extraction operations, it is very difficult to correctly extract fields and attribute values related to metadata in the files, and for a general standard PDF file, the contents can be in a text type or a picture type, and have no uniform standard, so that certain difficulty is brought to automatic extraction of a machine; generally, the data is input into a computer system by adopting a manual extraction method, but the standard files are huge in quantity and numerous in metadata entries, so that huge manpower and material resources are consumed, and the efficiency is low. Therefore, a method for extracting metadata with high accuracy and capable of automatically extracting different types of PDF texts is urgently needed, and the method for extracting metadata of a government affair metadata PDF file based on OCR is a feasible solution.
in the prior art, there are many image type text recognition algorithms based on OCR at present, and how to effectively integrate OCR recognition methods and recognize different PDF files is contained in the composition of current government affair PDF file metadata standard PDF files, and the main metadata attributes of corresponding metadata "definition", "english name" and "data type" are extracted from recognized text information in a template matching manner.
disclosure of Invention
in order to solve the technical problems, the invention provides a metadata extraction method for a government affair metadata PDF file, which can simultaneously process PDF files in the text type and the picture type.
the invention is realized by the following technical scheme.
The invention provides a metadata extraction method of a government affair metadata PDF file, which comprises the following steps:
(ii) government affairs metadata PDF text: inputting a PDF text, converting text type data in the PDF text into image type data, and acquiring a PDF text of full image type data;
OCR character recognition: preprocessing a full-image type data PDF text to obtain character information data of the full-image type data PDF text;
thirdly, recognizing the character template: inputting character information data, extracting field information data, performing character recognition, and completing metadata extraction.
in the first step, a PDF text is input, text type data in the PDF text is converted into image type data, the content of the PDF text is identified through an OCR character recognition engine, and a full image type data PDF text is obtained.
the step II comprises the following steps:
(2.1) text input: inputting a full image type data PDF text;
(2.2) blurring: carrying out fuzzy judgment on the full image type data PDF text, if the full image type data PDF text has a fuzzy image, processing the fuzzy image to obtain a clear image, and executing the step (2.3); if the PDF text of the full image type data is a clear image, executing the step (2.3);
(2.3) binarization: carrying out binarization processing on the PDF text of the full-image type data;
(2.4) denoising: denoising the full image type data PDF text;
(2.5) inclination correction: carrying out inclination judgment on the full image type data PDF text, if the full image type data PDF text is inclined, adopting an image inclination correction algorithm to correct, obtaining the inclination-free full image type data PDF text, and entering the step (2.6); if the PDF text of the full image type data is not inclined, the step (2.6) is carried out;
(2.6) character cutting: segmenting text information in a full-image type data PDF text, and segmenting the text information into single characters;
(2.7) feature extraction: extracting the character features by a gridding character feature extraction method through normalization processing to obtain 13-dimensional feature vectors;
(2.8) feature matching: performing character feature matching on the extracted 13-dimensional character vector and data in a full word set feature library, and selecting a word with the maximum recognition probability;
(2.9) error correction processing: carrying out error correction processing on the recognized characters, if the recognized characters are misplaced, updating the full character set feature library, and if the recognized characters are not misplaced, entering the step (1.10);
(2.10) text information data: and acquiring character information data.
the third step is divided into the following steps:
(3.1) converting the format: inputting character information data, and converting the character information data into a JSON data format through an OCR character recognition engine;
And (3.2) filtering the text information data: filtering invalid characters in the text information data;
(3.3) substitution &: dividing character information data into four attributes of definition, English name, data type and value field, replacing definition and value field with &, and segmenting according to &toobtain a target paragraph;
(3.4) traversing: traversing the target paragraphs, simultaneously carrying out fuzzy comparison on whether the English name and the data type exist in the target paragraphs, and marking the target paragraphs meeting the conditions as the fields to be extracted;
(3.5) segmenting the field to be extracted: the fields to be extracted are used ": "split, split the characters of definition, English name, data type and other contents;
(3.6) extracting specific text data: identifying and extracting other contents of each segmented section by adopting a target identification method, storing fields meeting conditions, and discarding field groups with the length less than 3;
(3.7) JSON data format: packaging the fields meeting the conditions in a reassembling mode, and respectively adding fields of 'definition', 'name' and 'data type' to form a new JSON data format;
(3.8) metadata: and completing metadata extraction.
and in the third step, the fields and the attribute values in the government affair metadata PDF file are extracted in a template matching mode.
In the step (3.2), the invalid characters comprise spaces and illegal characters.
The other contents are as follows: data other than definitions, english names, data types.
in the step (2.2), the image is converted into a frequency domain, and high-frequency components are extracted to carry out fuzzy judgment on the PDF text of the full-image type data; and processing the blurred image by adopting a high-pass filtering method in an image enhancement algorithm.
in the step (2.5), a rectangular region in the full image type data PDF text is selected, the rectangular region of the given full image type data PDF text is identified by using a text line tracking algorithm to obtain an inclination angle of the full image type data PDF text, and the inclination judgment of the full image type data PDF text is performed according to the inclination angle.
and in the step (3.2), filtering invalid characters in the text information data through an MSER Tree algorithm.
the invention has the beneficial effects that: the defects that automation cannot be realized, manual extraction is relied on manpower and the like in the process of extracting the metadata field and the attribute value of the government affair metadata PDF file in the prior art are overcome; the government affair PDF file of text or picture content is automatically identified by the OCR engine, and the relevant information of the metadata is automatically extracted by using a template matching mode according to the identified content and is input into the system, so that the metadata information input efficiency is improved, and the error rate of manual identification is reduced.
Drawings
FIG. 1 is a flow chart of the present invention.
Detailed Description
The technical solution of the present invention is further described below, but the scope of the claimed invention is not limited to the described.
as shown in fig. 1, a metadata extraction method for a government affair metadata PDF file includes the following steps:
(ii) government affairs metadata PDF text: inputting a PDF text, converting text type data in the PDF text into image type data, and acquiring a PDF text of full image type data;
OCR character recognition: preprocessing a full-image type data PDF text to obtain character information data of the full-image type data PDF text;
thirdly, recognizing the character template: inputting character information data, extracting field information data, performing character recognition, and completing metadata extraction.
In the first step, a PDF text is input, text type data in the PDF text is converted into image type data, the content of the PDF text is identified through an OCR character recognition engine, and a full image type data PDF text is obtained.
the step II comprises the following steps:
(2.1) text input: inputting a full image type data PDF text;
(2.2) blurring: converting the image into a frequency domain, extracting high-frequency components to perform fuzzy judgment on a full-image type data PDF text, if the full-image type data PDF text has a fuzzy image, processing the fuzzy image by adopting a high-pass filtering method in an image enhancement algorithm to obtain a clear image, and executing the step (2.3); if the PDF text of the full image type data is a clear image, executing the step (2.3);
(2.3) binarization: carrying out binarization processing on the PDF text of the full-image type data;
(2.4) denoising: denoising the full image type data PDF text;
(2.5) inclination correction: selecting a rectangular area in the image, identifying the rectangular area of the image by using a text line tracking algorithm to obtain an inclination angle of the image, performing inclination judgment on the PDF text of the full image type data according to the inclination angle, if the PDF text of the full image type data is inclined, correcting by using an image inclination correction algorithm to obtain the PDF text of the full image type data without inclination, and entering the step (2.6); if the PDF text of the full image type data is not inclined, the step (2.6) is carried out;
(2.6) character cutting: segmenting text information in a full-image type data PDF text, and segmenting the text information into single characters;
(2.7) feature extraction: performing feature extraction on the characters by adopting a gridding character feature extraction method, converting an outer frame into 32 × 16 pixel values through normalization processing, dividing an image into 3 × 3 ═ 9 small grids, and counting the number of black pixels in each grid to form a 9-dimensional vector; the other side is to obtain the characteristic value of the intersection point, and horizontal and vertical line penetration characters are made at the trisection positions in the horizontal and vertical directions, and the intersection times of the characters and the edge are calculated to obtain 4 values, so that 13-dimensional characteristic vectors are obtained;
(2.8) feature matching: performing character feature matching on the extracted 13-dimensional character vector and data in a full word set feature library, and selecting a word with the maximum recognition probability;
(2.9) error correction processing: carrying out error correction processing on the recognized characters, if the recognized characters are misplaced, updating the full character set feature library, and if the recognized characters are not misplaced, entering the step (1.10);
(2.10) text information data: and acquiring character information data.
the third step is divided into the following steps:
(3.1) converting the format: inputting character information data, and converting the character information data into a JSON data format through an OCR character recognition engine;
And (3.2) filtering the text information data: filtering invalid characters in the text information data through an MSER Tree algorithm;
(3.3) substitution &: dividing character information data into four attributes of definition, English name, data type and value field, replacing definition and value field with &, and segmenting according to &toobtain a target paragraph;
(3.4) traversing: traversing the target paragraphs, simultaneously carrying out fuzzy comparison on whether the English name and the data type exist in the target paragraphs, and marking the target paragraphs meeting the conditions as the fields to be extracted;
(3.5) segmenting the field to be extracted: the fields to be extracted are used ": "split, split the characters of definition, English name, data type and other contents;
(3.6) extracting specific text data: identifying and extracting other contents of each segmented section by adopting a target identification method, storing fields meeting conditions, and discarding field groups with the length less than 3;
(3.7) JSON data format: packaging the fields meeting the conditions in a reassembling mode, and respectively adding fields of 'definition', 'name' and 'data type' to form a new JSON data format;
(3.8) metadata: and completing metadata extraction.
And in the third step, the fields and the attribute values in the government affair metadata PDF file are extracted in a template matching mode.
in the step (3.2), the invalid characters comprise spaces and illegal characters.
The other contents are as follows: data other than definitions, english names, data types.
Example 1
as described above, a method for processing PDF files of text type and picture type simultaneously increases the sources of text information recognized by PDF compared with a conventional OCR engine, and includes the following steps:
step1.1, converting the text type PDF file into an image type PDF file for processing;
Step1.2, judging whether the processed PDF file is fuzzy, and if so, processing the fuzzy file;
Step1.3, if the PDF file is not a binary image, the PDF file needs to be subjected to binarization processing, the obtained image can reproduce the original Chinese characters, and the key of binarization is to select a threshold value which is usually represented by a threshold operator in the form of a ternary function.
step1.4, carrying out PDF file denoising treatment, improving the definition of the PDF file, and further improving the accuracy of OCR recognition;
Step1.5, if the image of the PDF file is inclined, correcting by using an image inclination correction algorithm (as mentioned in the prior art: https:// closed.content.com/leveller/angle/1094604), and ensuring the non-inclined image text required by the subsequent operation;
Step1.6, segmenting the text by utilizing a certain blank gap between characters in the image text;
step1.7, after line segmentation is finished, text lines need to be segmented into single characters, namely a task needing word segmentation is carried out, left and right boundaries of single words of each line of characters are searched from left to right, and single words or punctuation marks are segmented;
step1.8, converting the outer frame into 32 x 16 pixel values by normalization processing by adopting a gridding character feature extraction method; then, dividing the image into 9 small grids with the size of 3 multiplied by 3, and counting the number of black pixels in each grid to form a 9-dimensional vector; the other side is to obtain the characteristic value of the intersection point, and horizontal and vertical line penetration characters are made at the trisection positions in the horizontal and vertical directions, and the intersection times of the characters and the edge are calculated to obtain 4 values, and 13-dimensional characteristic vectors are obtained;
Step1.9, performing character feature matching on the extracted 13-dimensional feature vector of the character and data in a full word set feature library, and selecting a word with the maximum recognition probability;
step1.10, error correction processing is carried out on the recognized characters, if the recognized characters are misplaced, the full character set feature library is updated, and similar character recognition errors are avoided;
Step1.11, finally outputting the text information of the PDF file;
Secondly, removing unnecessary symbols and spaces in the text by using a related natural language text processing method for the character output information of the government affair metadata standard PDF file, labeling by adopting a text substitution method, segmenting the labeled text, and then identifying a target text segment according to the characteristics of the metadata; after proceeding with the target text segment ": the number is segmented to obtain short text information containing metadata information; and finally, extracting the target short text information, assembling and returning to the system for recording.
Specifically, extracting corresponding fields and attribute values in the government affair metadata PDF file based on a template matching mode comprises the following steps
step2.1, the data returned by the OCR character recognition engine contains other contents except the text information, and needs to be converted into a JSON data format to obtain the text information of the relevant text data information field;
step2.2, because the accuracy of PDF text recognition cannot achieve 100% accuracy at present, the recognized text information contains many unnecessary information, such as: spaces, illegal characters and the like, wherein the spaces and the illegal characters need to be removed to obtain processed text information;
Step2.3, according to investigation on text information of a large amount of government affairs metadata, a common metadata file contains four common attributes of definition, English name, data type and value field, and in order to cut the text, the definition and the value field need to be replaced by "&" for further processing;
step2.4, segmenting the text according to the "&" symbol, thus segmenting three commonly used attribute names including a defined value, an English name and a data type and a value into a target paragraph;
Step2.5, traversing all the target paragraphs, simultaneously carrying out fuzzy comparison on whether the target paragraphs have words such as English names, data types and the like, and marking the target paragraphs meeting the conditions as the fields to be extracted;
step2.6, use the field information to be extracted ": "split, split the character and content of definition, English name, data type;
step2.7, identifying and extracting the content of each section by adopting a target identification mode, storing the fields meeting the conditions, and discarding the field group with the length less than 3;
step2.8, packaging the fields meeting the conditions in a reassembling mode, and adding fields of 'definition', 'name' and 'data type' respectively to form a JSON data format;
Step2.9, inputting the target data in the JSON format into the system, and integrally realizing the metadata information input of different types of PDF files.
preferably, the extracted government affair metadata field and the attribute value can be the same as those of the original file, and different places only need to be slightly modified, so that the extraction efficiency is greatly improved, the manpower and material resources are saved, and the aim of improving the working efficiency is fulfilled compared with a manual metadata extraction mode.
furthermore, the method for extracting the government affair metadata PDF file and the metadata, which is constructed by the method, can efficiently and automatically extract the metadata fields and the attribute values in the government affair metadata PDF files with different text types, solves the heavy workload of manual extraction, enables the metadata extraction of the government affair metadata PDF file to be more efficient and intelligent, solves the problem that a user faces heavy government affair metadata attribute value processing, and improves the handling efficiency.
In summary, the method processes the government affair metadata PDF files, so that different types of metadata PDF files can be processed by OCR; then, using an OCR character recognition engine to recognize the content in the PDF file; and finally, extracting important information such as fields, attribute values and the like of the metadata by a template matching method of government affair metadata information, and inputting the information into a system, thereby realizing automatic extraction of standard metadata PDF files and improving the efficiency.

Claims (10)

1. a metadata extraction method for a government affair metadata PDF file is characterized by comprising the following steps: the method comprises the following steps:
(ii) government affairs metadata PDF text: inputting a PDF text, converting text type data in the PDF text into image type data, and acquiring a PDF text of full image type data;
OCR character recognition: preprocessing a full-image type data PDF text to obtain character information data of the full-image type data PDF text;
Thirdly, recognizing the character template: inputting character information data, extracting field information data, performing character recognition, and completing metadata extraction.
2. the metadata extraction method of a government affairs metadata PDF file according to claim 1, wherein: in the first step, a PDF text is input, text type data in the PDF text is converted into image type data, the content of the PDF text is identified through an OCR character recognition engine, and a full image type data PDF text is obtained.
3. the metadata extraction method of a government affairs metadata PDF file according to claim 1, wherein: the step II comprises the following steps:
(2.1) text input: inputting a full image type data PDF text;
(2.2) blurring: carrying out fuzzy judgment on the full image type data PDF text, if the full image type data PDF text has a fuzzy image, processing the fuzzy image to obtain a clear image, and executing the step (2.3); if the PDF text of the full image type data is a clear image, executing the step (2.3);
(2.3) binarization: carrying out binarization processing on the PDF text of the full-image type data;
(2.4) denoising: denoising the full image type data PDF text;
(2.5) inclination correction: carrying out inclination judgment on the full image type data PDF text, if the full image type data PDF text is inclined, adopting an image inclination correction algorithm to correct, obtaining the inclination-free full image type data PDF text, and entering the step (2.6); if the PDF text of the full image type data is not inclined, the step (2.6) is carried out;
(2.6) character cutting: segmenting text information in a full-image type data PDF text, and segmenting the text information into single characters;
(2.7) feature extraction: extracting the character features by a gridding character feature extraction method through normalization processing to obtain 13-dimensional feature vectors;
(2.8) feature matching: performing character feature matching on the extracted 13-dimensional character vector and data in a full word set feature library, and selecting a word with the maximum recognition probability;
(2.9) error correction processing: carrying out error correction processing on the recognized characters, if the recognized characters are misplaced, updating the full character set feature library, and if the recognized characters are not misplaced, entering the step (1.10);
(2.10) text information data: and acquiring character information data.
4. the metadata extraction method of a government affairs metadata PDF file according to claim 1, wherein: the third step is divided into the following steps:
(3.1) converting the format: inputting character information data, and converting the character information data into a JSON data format through an OCR character recognition engine;
And (3.2) filtering the text information data: filtering invalid characters in the text information data;
(3.3) substitution &: dividing character information data into four attributes of definition, English name, data type and value field, replacing definition and value field with &, and segmenting according to &toobtain a target paragraph;
(3.4) traversing: traversing the target paragraphs, simultaneously carrying out fuzzy comparison on whether the English name and the data type exist in the target paragraphs, and marking the target paragraphs meeting the conditions as the fields to be extracted;
(3.5) segmenting the field to be extracted: the fields to be extracted are used ": "split, split the characters of definition, English name, data type and other contents;
(3.6) extracting specific text data: identifying and extracting other contents of each segmented section by adopting a target identification method, storing fields meeting conditions, and discarding field groups with the length less than 3;
(3.7) JSON data format: packaging the fields meeting the conditions in a reassembling mode, and respectively adding fields of 'definition', 'name' and 'data type' to form a new JSON data format;
(3.8) metadata: and completing metadata extraction.
5. the metadata extraction method of a government affairs metadata PDF file according to claim 1, wherein: and in the third step, the fields and the attribute values in the government affair metadata PDF file are extracted in a template matching mode.
6. the metadata extraction method of a government affairs metadata PDF file according to claim 4, wherein: in the step (3.2), the invalid characters comprise spaces and illegal characters.
7. The metadata extraction method of a government affairs metadata PDF file according to claim 4, wherein: the other contents are as follows: data other than definitions, english names, data types.
8. A metadata extraction method of a government affairs metadata PDF file according to claim 3, wherein: in the step (2.2), the image is converted into a frequency domain, and high-frequency components are extracted to carry out fuzzy judgment on the PDF text of the full-image type data; and processing the blurred image by adopting a high-pass filtering method in an image enhancement algorithm.
9. A metadata extraction method of a government affairs metadata PDF file according to claim 3, wherein: in the step (2.5), a rectangular region in the full image type data PDF text is selected, the rectangular region of the given full image type data PDF text is identified by using a text line tracking algorithm to obtain an inclination angle of the full image type data PDF text, and the inclination judgment of the full image type data PDF text is performed according to the inclination angle.
10. the metadata extraction method of a government affairs metadata PDF file according to claim 4, wherein: and in the step (3.2), filtering invalid characters in the text information data through an MSER Tree algorithm.
CN201910791805.5A 2019-08-26 2019-08-26 metadata extraction method for government affair metadata PDF file Pending CN110543844A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910791805.5A CN110543844A (en) 2019-08-26 2019-08-26 metadata extraction method for government affair metadata PDF file

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910791805.5A CN110543844A (en) 2019-08-26 2019-08-26 metadata extraction method for government affair metadata PDF file

Publications (1)

Publication Number Publication Date
CN110543844A true CN110543844A (en) 2019-12-06

Family

ID=68712113

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910791805.5A Pending CN110543844A (en) 2019-08-26 2019-08-26 metadata extraction method for government affair metadata PDF file

Country Status (1)

Country Link
CN (1) CN110543844A (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111291672A (en) * 2020-01-22 2020-06-16 广州图匠数据科技有限公司 Method and device for combined image text recognition and fuzzy judgment and storage medium
CN111401312A (en) * 2020-04-10 2020-07-10 深圳新致软件有限公司 PDF drawing character recognition method, system and equipment
CN112528984A (en) * 2020-12-18 2021-03-19 平安银行股份有限公司 Image information extraction method, device, electronic equipment and storage medium
CN113065537A (en) * 2021-06-03 2021-07-02 江苏联著实业股份有限公司 OCR file format conversion method and system based on model optimization
CN113535943A (en) * 2020-04-14 2021-10-22 阿里巴巴集团控股有限公司 Medical record classification method and device and data record classification method and device

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110046261A (en) * 2019-04-22 2019-07-23 山东建筑大学 A kind of construction method of the multi-modal bilingual teaching mode of architectural engineering

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110046261A (en) * 2019-04-22 2019-07-23 山东建筑大学 A kind of construction method of the multi-modal bilingual teaching mode of architectural engineering

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
任亚唯著: "《视觉密码技术研究》" *
朱俊波: "纸本馆藏图书目次信息著录分析", 《图书情报工作》 *
武华维: "一种基于 OCR 技术的档案目录数据", 《档案》 *
陈博等: "基于文本挖掘和可视化技术的主题自动标引方法——以《 英雄格萨尔》 为例", 《现代情报》 *

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111291672A (en) * 2020-01-22 2020-06-16 广州图匠数据科技有限公司 Method and device for combined image text recognition and fuzzy judgment and storage medium
CN111291672B (en) * 2020-01-22 2023-05-12 广州图匠数据科技有限公司 Combined image text recognition and fuzzy judgment method, device and storage medium
CN111401312A (en) * 2020-04-10 2020-07-10 深圳新致软件有限公司 PDF drawing character recognition method, system and equipment
CN111401312B (en) * 2020-04-10 2024-04-26 深圳新致软件有限公司 PDF drawing text recognition method, system and equipment
CN113535943A (en) * 2020-04-14 2021-10-22 阿里巴巴集团控股有限公司 Medical record classification method and device and data record classification method and device
CN112528984A (en) * 2020-12-18 2021-03-19 平安银行股份有限公司 Image information extraction method, device, electronic equipment and storage medium
CN113065537A (en) * 2021-06-03 2021-07-02 江苏联著实业股份有限公司 OCR file format conversion method and system based on model optimization

Similar Documents

Publication Publication Date Title
CN110543844A (en) metadata extraction method for government affair metadata PDF file
Ray Choudhury et al. An architecture for information extraction from figures in digital libraries
US8965127B2 (en) Method for segmenting text words in document images
CN111027297A (en) Method for processing key form information of image type PDF financial data
CN108363701B (en) Named entity identification method and system
CN110188649B (en) Pdf file analysis method based on tesseract-ocr
US11010543B1 (en) Systems and methods for table extraction in documents
JP2012500428A (en) Segment print pages into articles
CN103995904A (en) Recognition system for image file electronic data
CN109583438B (en) The recognition methods of the text of electronic image and image processing apparatus
CN110866116A (en) Policy document processing method and device, storage medium and electronic equipment
CN112580308A (en) Document comparison method and device, electronic equipment and readable storage medium
Van Phan et al. A nom historical document recognition system for digital archiving
CN113901952A (en) Print form and handwritten form separated character recognition method based on deep learning
CN113326797A (en) Method for converting form information extracted from PDF document into structured knowledge
CN112508000B (en) Method and equipment for generating OCR image recognition model training data
Karanje et al. Survey on text detection, segmentation and recognition from a natural scene images
CN117076455A (en) Intelligent identification-based policy structured storage method, medium and system
CN111966640A (en) Document file identification method and system
CN115294593A (en) Image information extraction method and device, computer equipment and storage medium
Ranka et al. Automatic table detection and retention from scanned document images via analysis of structural information
CN112348022A (en) Free-form document identification method based on deep learning
CN112364790A (en) Airport work order information identification method and system based on convolutional neural network
CN110807449A (en) Science and technology project application on-line service terminal
Kurhekar et al. Automated text and tabular data extraction from scanned document images

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination