CN114417820A - Content filtering method for target object - Google Patents

Content filtering method for target object Download PDF

Info

Publication number
CN114417820A
CN114417820A CN202210093056.0A CN202210093056A CN114417820A CN 114417820 A CN114417820 A CN 114417820A CN 202210093056 A CN202210093056 A CN 202210093056A CN 114417820 A CN114417820 A CN 114417820A
Authority
CN
China
Prior art keywords
target
target document
document
content
elements
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210093056.0A
Other languages
Chinese (zh)
Inventor
金虎杰
陈德全
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Menglang Sustainable Digital Technology Shenzhen Co ltd
Original Assignee
Menglang Sustainable Digital Technology Shenzhen Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Menglang Sustainable Digital Technology Shenzhen Co ltd filed Critical Menglang Sustainable Digital Technology Shenzhen Co ltd
Priority to CN202210093056.0A priority Critical patent/CN114417820A/en
Publication of CN114417820A publication Critical patent/CN114417820A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Artificial Intelligence (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Computation (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Document Processing Apparatus (AREA)

Abstract

The invention discloses a content filtering method for a target object, relates to the technical field of document content extraction, and solves the technical problem that the prior art cannot realize the targeted analysis of a PDF document and cannot quickly extract effective content from the PDF document; setting a target element for target document matching, and dividing and filtering the target document according to the target element to obtain target content; after the target document is read, the target element is set by combining the type label of the target document, the target element can be set manually or automatically through the incidence relation, the extraction and filtering requirements of the target document under different scenes can be met, and the extraction content can better meet the requirements of users; the target elements in the invention comprise paragraphs, chapters, headers, footers and the like, aiming at the target documents with different types of labels, the combination of different target elements is set, and the corresponding processing mode is matched for each element, thus the accuracy of extracting the target document can be ensured.

Description

Content filtering method for target object
Technical Field
The invention belongs to the field of document content extraction, and relates to a content filtering technology for a target object, in particular to a content filtering method for the target object.
Background
Pdf (portable Document format) can reproduce each character, image, and corresponding color in an original Document, and is widely used in various fields due to its advantages of perfect standards, high output quality, and the like; however, because the document does not have a logic structure for recording the document and does not have logic elements such as paragraphs and tables, the difficulty in extracting the target content from the PDF is high.
In the prior art (patent of invention with document number CN 111259623A), a paragraph structure mark symbol is added to a PDF document, a PDF document analysis tool is combined to determine paragraph attribute characteristics of the PDF document, and a final extraction model is constructed based on the influence degree of the paragraph attribute characteristics on the paragraph structure, so as to achieve the purpose of a PDF document paragraph; in the prior art, paragraph extraction of a PDF document can be completed only through a neural network model, the PDF document cannot be further analyzed, and effective contents cannot be extracted from the PDF document. Therefore, a content filtering method for a target object is needed.
Disclosure of Invention
The present invention is directed to solving at least one of the problems of the prior art; therefore, the invention provides a content filtering method for a target object, which is used for solving the technical problem that the prior art can not realize the targeted analysis of a PDF document, so that the effective content can not be quickly extracted from the PDF document; according to the method, the target document is read firstly, the target elements are set by combining the type tags of the target document, then the target document is divided and filtered according to the target elements, the target content is obtained, and the effectiveness and the extraction efficiency of the target content can be effectively guaranteed so as to solve the problems.
To achieve the above object, a first aspect of the present invention provides a content filtering method for a target object, including:
reading a target document, and setting a target element by combining a type label of the target document; the target elements are manually or automatically set and comprise one or more of paragraphs, chapters, headers and footers;
dividing and filtering the target document according to the target elements to obtain target content; the type tag is set through the content attribute of the target document, and the target document is a PDF document.
Preferably, the type tag is set according to a content attribute of the target document, and includes:
when the target document is an enterprise annual report, setting the corresponding type tag as 1;
when the target document is a disclosure report, setting the corresponding type tag to 2;
when the target document is a responsibility report, the corresponding type tag is set to 3.
Preferably, the automatically setting the target element according to the type tag of the target document includes:
acquiring a type label of a target document;
extracting the associated elements according to the type labels and the association relation, and marking the associated elements as target elements; wherein the incidence relation is set manually.
Preferably, when the target element includes a paragraph, paragraph segmentation is performed on the target document, including:
reading a target document, and dividing the target document into a plurality of storage units in sequence; each storage unit corresponds to a line of text in the target document;
combining a plurality of storage units with the paragraph clustering model to obtain divided paragraphs; and establishing a paragraph clustering model according to the font characteristic parameters.
Preferably, when the target element includes a chapter, performing chapter division on the target document includes:
reading a target document, and splitting the target document into a plurality of pages;
dividing each page into a plurality of grids; wherein the side length of each grid is half of the minimum font size, and the grids are rectangular;
sequentially acquiring the average gray scale of the grids according to the typesetting sequence, deleting the corresponding grids when the average gray scale is 255, and otherwise marking the grids;
re-reading the target document to obtain characters and simultaneously obtaining the positions of the characters; and associating the character position with the marked grid position, and rearranging the characters according to the grid position and the typesetting sequence.
Preferably, when the target element includes a header, determining a header row of the target document includes:
reading a target document, identifying and obtaining a single character, and carrying out merging conversion on the single character to generate a mode string;
performing similarity calculation on the pattern strings of the front N pages of the target document to determine header rows; wherein N is a positive integer greater than or equal to 10.
Preferably, when the target element includes a footer, determining a footer row of the target document includes:
and performing similarity calculation on the pattern strings of the N pages behind the target document to determine a footer row.
Preferably, when the header line and/or the footer line are determined, corresponding content in the target document is automatically filtered.
Compared with the prior art, the invention has the beneficial effects that:
1. after the target document is read, the target elements are set for the target document by combining the type label of the target document, the target elements can be set manually or automatically through the association relationship, the extraction and filtering requirements of the target document under different scenes can be met, and the extracted content can better meet the requirements of users.
2. The target elements in the invention comprise paragraphs, chapters, headers, footers and the like, aiming at the target documents with different types of labels, the combination of different target elements is set, and the corresponding processing mode is matched for each element, thus the accuracy of extracting the target document can be ensured.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.
FIG. 1 is a schematic diagram of the working steps of the present invention.
Detailed Description
The technical solutions of the present invention will be described clearly and completely with reference to the following embodiments, and it should be understood that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
PDF documents are applied to various fields because of their various outstanding advantages, but because they do not have a logical structure for recording documents and do not have logical elements such as paragraphs and tables, it is difficult to extract target content from PDF.
In the prior art (patent of invention with document number CN 111259623A), a paragraph structure mark symbol is added to a PDF document, a PDF document analysis tool is combined to determine paragraph attribute characteristics of the PDF document, and a final extraction model is constructed based on the influence degree of the paragraph attribute characteristics on the paragraph structure, so as to achieve the purpose of a PDF document paragraph; in the prior art, paragraph extraction of a PDF document can be completed only through a neural network model, and headers, footers and other insubstantial contents cannot be effectively extracted.
According to the method, the target document is read firstly, the target elements are set by combining the type tags of the target document, then the target document is divided and filtered according to the target elements, the target content is obtained, and the effectiveness and the extraction efficiency of the target content can be effectively guaranteed.
Referring to fig. 1, a first aspect of the present application provides a content filtering method for a target object, including:
reading a target document, and setting a target element by combining a type label of the target document;
and dividing and filtering the target document according to the target elements to obtain target content.
The section in the application does not discuss a certain section or a certain section in the text, but includes a plurality of contents which are randomly typeset in a certain page in the target document, such as the exhibition about company brief introduction and related numbers in the social responsibility report.
In the application, the type tag is set according to the content attribute of the target document, namely, the type tag is set according to the type of the target document, and if the target document is an enterprise yearbook, the corresponding type tag is set to be 1; when the target document is a disclosure report, setting the corresponding type tag to 2; when the target document is a responsibility report, the corresponding type tag is set to 3.
It is understood that the target document in the present application is a PDF document, and may be other documents consistent with the PDF document generation principle.
The target elements in the application are manually or automatically set and comprise one or more of paragraphs, chapters, headers and footers; when the target element is a paragraph, the paragraph division and extraction of the target document are required, when the target element is a chapter, the chapter extraction of the target document is required, and when the target element is a header or a footer, the header or the footer is required to be filtered; it is understood that the target elements may also include other elements such as text, pictures, etc.
In an alternative embodiment, the target elements are manually set for the target document, including:
when the type tag of the target document is 1, the corresponding target element can be set as a paragraph and a chapter;
when the type tag of the target document is 2, the corresponding target elements may be set as a header and a footer;
when the type tag of the target document is 3, the corresponding target element may be set as text, i.e. the content of the text is extracted.
In the embodiment, target elements are manually set for different types of tags, and on the basis of identifying the type tag of the target document, the target elements are set according to the intention of a user, so that the target document is extracted.
In an alternative embodiment, automatically setting the target element according to the type tag of the target document includes:
acquiring a type label of a target document;
and extracting the associated elements according to the type labels and the association relation, and marking the associated elements as target elements.
In the embodiment, when the type tag of the target document is identified, directly associating elements (target elements), and directly extracting the target document according to the associated elements; the related elements are preset and stored, and the setting of the related elements is specifically explained as follows:
when the type label of the target document is 1, the associated elements corresponding to the type label are paragraphs and chapters;
when the type label of the target document is 2, the associated elements corresponding to the type label are a header and a footer;
and when the type label of the target document is 3, the associated element corresponding to the type label is a text.
In one embodiment, when the target element comprises a paragraph, paragraph splitting the target document comprises:
reading a target document, and dividing the target document into a plurality of storage units in sequence;
combining a plurality of storage units with the paragraph clustering model to obtain the divided paragraphs.
One storage unit in the present embodiment corresponds to one line of text in the target document; the paragraph clustering model is built according to the font characteristic parameters, specifically, the font characteristic parameters (font size, font spacing, font style, line alignment, etc.) and the clustering model are combined to obtain, and the specific building process can refer to blog articles (https:// blog.csdn.net/weixin _45615071/article/details/108124735) and other prior art.
In one embodiment, when the target element includes a chapter, performing chapter division on the target document includes:
reading a target document, and splitting the target document into a plurality of pages;
dividing each page into a plurality of grids;
sequentially acquiring the average gray scale of the grids according to the typesetting sequence, deleting the corresponding grids when the average gray scale is 255, and otherwise marking the grids;
re-reading the target document to obtain characters and simultaneously obtaining the positions of the characters; and associating the character position with the marked grid position, and rearranging the characters according to the grid position and the typesetting sequence.
The main purpose in the embodiment is to extract different parts in the same target document; the side length of each grid is not less than half of the side length of the minimum font size in the page, and the grids are rectangular, and the side length of each grid is half of the minimum font size in the embodiment.
The typesetting order in this embodiment is from left to right and from top to bottom.
This example is illustrated by way of example:
step 11: reading the text content of each page of the target document, and regenerating the read content into a picture according to the text position and the page size;
step 12: dividing the picture into a plurality of grids;
step 13: obtaining the average gray scale according to the typesetting sequence, deleting the grid when the color corresponding to the average gray scale is white, otherwise marking the grid;
step 14: and associating the character position with the marked grid position, and rearranging the characters to realize the extraction of chapter content.
In one embodiment, when the target element comprises a header, determining a header row of the target document comprises:
reading a target document, identifying and obtaining a single character, and carrying out merging conversion on the single character to generate a mode string;
and performing similarity calculation on the pattern strings of the first N pages of the target document to determine a header line.
The similarity calculation of the pattern string (character string) can be realized by various methods, such as cosine similarity, matrix similarity and character string editing distance; it can be understood that when a certain line in several target documents is the same, the line can be understood as a header or a footer, and even if no character exists in the line in a certain page, the filtering of the header and the footer of the target document is not influenced.
This example is illustrated by way of example:
step 21: reading a target document, identifying and obtaining a single character;
step 22: merging the single characters according to the rows through the character positions;
step 23: converting a single character into a pattern string, such as "12 Maotai shares Limited group" into [ number ] + [ character ];
step 24: performing similarity calculation on the front 10 pages of the target document by using rows, and determining header rows by adopting a mechanism similar to a PageRank mechanism;
step 25: after the header row is determined, the filtering header is automatically matched when the target document is read.
In one embodiment, when the target element includes a footer, determining a footer row of the target document comprises:
and performing similarity calculation on the pattern strings of the 10 pages behind the target document to determine a footer row.
It should be noted that when the similarity of a certain line in 10 pages exceeds the set similarity threshold, a header line or a footer line may be determined; when the target element in the present application includes a plurality of elements, the target document is processed according to the element order.
One of the core points of this application: according to the method and the device, after the target document is read, the target element is set for the target document by combining the type tag of the target document, manual setting can be performed, automatic setting can also be performed through the incidence relation, the extraction and filtering requirements for the target document under different scenes can be met, and the extracted content can better meet the user requirements.
The second core point of the application is: the target elements in the application comprise paragraphs, chapters, headers, footers and the like, combinations of different target elements are set for target documents of different types of labels, corresponding processing modes are matched for the elements, and the accuracy of target document extraction can be guaranteed.
Part of data in the formula is obtained by removing dimension and taking the value to calculate, and the formula is obtained by simulating a large amount of collected data through software and is closest to a real situation; the preset parameters and the preset threshold values in the formula are set by those skilled in the art according to actual conditions or obtained through simulation of a large amount of data.
The working principle of the invention is as follows:
reading a target document and acquiring a corresponding type label; and setting target elements for the type labels in a manual setting or automatic setting mode.
And calling a processing method corresponding to the target element, and analyzing the target document page by page to acquire target content.
Although the present invention has been described in detail with reference to the preferred embodiments, it will be understood by those skilled in the art that various changes may be made and equivalents may be substituted for elements thereof without departing from the spirit and scope of the present invention.

Claims (7)

1. A content filtering method for a target object, comprising:
reading a target document, and setting a target element by combining a type label of the target document; the target elements are manually or automatically set and comprise one or more of paragraphs, chapters, headers and footers;
dividing and filtering the target document according to the target elements to obtain target content; wherein the type tag is set by a content attribute of the target document.
2. The content filtering method for a target object according to claim 1, wherein automatically setting a target element according to a type tag of the target document comprises:
acquiring a type label of a target document;
extracting the associated elements according to the type labels and the association relation, and marking the associated elements as target elements; wherein the incidence relation is set manually.
3. The method for filtering contents of a target object according to claim 1, wherein when said target element includes a paragraph, paragraph division is performed on a target document, including:
reading a target document, and dividing the target document into a plurality of storage units in sequence; each storage unit corresponds to a line of text in the target document;
combining a plurality of storage units with the paragraph clustering model to obtain divided paragraphs; and establishing a paragraph clustering model according to the font characteristic parameters.
4. The content filtering method for a target object according to claim 1, wherein when the target element includes a chapter, performing chapter division on a target document includes:
reading a target document, and splitting the target document into a plurality of pages;
dividing each page into a plurality of grids; wherein the side length of each grid is half of the minimum font size, and the grids are rectangular;
sequentially acquiring the average gray scale of the grids according to the typesetting sequence, deleting the corresponding grids when the average gray scale is 255, and otherwise marking the grids;
re-reading the target document to obtain characters and simultaneously obtaining the positions of the characters; and associating the character position with the marked grid position, and rearranging the characters according to the grid position and the typesetting sequence.
5. The method of claim 1, wherein when the target element comprises a header, determining a header line of the target document comprises:
reading a target document, identifying and obtaining a single character, and carrying out merging conversion on the single character to generate a mode string;
performing similarity calculation on the pattern strings of the front N pages of the target document to determine header rows; wherein N is a positive integer greater than or equal to 10.
6. The content filtering method for the target object according to claim 5, wherein when the target element includes a footer, determining a footer line of the target document comprises:
and performing similarity calculation on the pattern strings of the N pages behind the target document to determine a footer row.
7. The method of claim 6, wherein corresponding content in the target document is automatically filtered when determining the header line and/or the footer line.
CN202210093056.0A 2022-01-26 2022-01-26 Content filtering method for target object Pending CN114417820A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210093056.0A CN114417820A (en) 2022-01-26 2022-01-26 Content filtering method for target object

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210093056.0A CN114417820A (en) 2022-01-26 2022-01-26 Content filtering method for target object

Publications (1)

Publication Number Publication Date
CN114417820A true CN114417820A (en) 2022-04-29

Family

ID=81278008

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210093056.0A Pending CN114417820A (en) 2022-01-26 2022-01-26 Content filtering method for target object

Country Status (1)

Country Link
CN (1) CN114417820A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117272970A (en) * 2023-11-22 2023-12-22 太平金融科技服务(上海)有限公司深圳分公司 Document generation method, device, equipment and storage medium

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117272970A (en) * 2023-11-22 2023-12-22 太平金融科技服务(上海)有限公司深圳分公司 Document generation method, device, equipment and storage medium
CN117272970B (en) * 2023-11-22 2024-03-01 太平金融科技服务(上海)有限公司深圳分公司 Document generation method, device, equipment and storage medium

Similar Documents

Publication Publication Date Title
CN110705223A (en) Footnote recognition and extraction method for multi-page layout document
CN110704570A (en) Continuous page layout document structured information extraction method
CN111627088A (en) Sample automatic generation method for mathematical test paper image recognition
WO2019041442A1 (en) Method and system for structural extraction of figure data, electronic device, and computer readable storage medium
CN110287784B (en) Annual report text structure identification method
CN112199937B (en) Short text similarity analysis method and system, computer equipment and medium thereof
Suryani et al. The handwritten sundanese palm leaf manuscript dataset from 15th century
CN113326797A (en) Method for converting form information extracted from PDF document into structured knowledge
CN114881698A (en) Advertisement compliance auditing method and device, electronic equipment and storage medium
CN110909123A (en) Data extraction method and device, terminal equipment and storage medium
CN112560850A (en) Automatic identity card information extraction and authenticity verification method based on custom template
CN112149401A (en) Document comparison identification method and system based on ocr
CN114417820A (en) Content filtering method for target object
CN112347742B (en) Method for generating document image set based on deep learning
CN112464957A (en) Method and device for acquiring structured data based on unstructured bid document content
CN110489514B (en) System and method for improving event extraction labeling efficiency, event extraction method and system
CN114579796B (en) Machine reading understanding method and device
CN111966640A (en) Document file identification method and system
CN106709437A (en) Improved intelligent processing method for image-text information of scanning copy of early patent documents
CN116403233A (en) Image positioning and identifying method based on digitized archives
CN113127595B (en) Method, device, equipment and storage medium for extracting viewpoint details of research and report abstract
CN115203474A (en) Automatic database classification and extraction technology
CN113987292A (en) Construction method of Chinese wolfberry insect pest cross-modal retrieval data set
CN112990091A (en) Research and report analysis method, device, equipment and storage medium based on target detection
CN117275022A (en) PDF file complex form recognition and structured data-based method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination