CN114417820A

CN114417820A - Content filtering method for target object

Info

Publication number: CN114417820A
Application number: CN202210093056.0A
Authority: CN
Inventors: 金虎杰; 陈德全
Original assignee: Menglang Sustainable Digital Technology Shenzhen Co ltd
Current assignee: Menglang Sustainable Digital Technology Shenzhen Co ltd
Priority date: 2022-01-26
Filing date: 2022-01-26
Publication date: 2022-04-29

Abstract

The invention discloses a content filtering method for a target object, relates to the technical field of document content extraction, and solves the technical problem that the prior art cannot realize the targeted analysis of a PDF document and cannot quickly extract effective content from the PDF document; setting a target element for target document matching, and dividing and filtering the target document according to the target element to obtain target content; after the target document is read, the target element is set by combining the type label of the target document, the target element can be set manually or automatically through the incidence relation, the extraction and filtering requirements of the target document under different scenes can be met, and the extraction content can better meet the requirements of users; the target elements in the invention comprise paragraphs, chapters, headers, footers and the like, aiming at the target documents with different types of labels, the combination of different target elements is set, and the corresponding processing mode is matched for each element, thus the accuracy of extracting the target document can be ensured.

Description

Content filtering method for target object

Technical Field

The invention belongs to the field of document content extraction, and relates to a content filtering technology for a target object, in particular to a content filtering method for the target object.

Background

Pdf (portable Document format) can reproduce each character, image, and corresponding color in an original Document, and is widely used in various fields due to its advantages of perfect standards, high output quality, and the like; however, because the document does not have a logic structure for recording the document and does not have logic elements such as paragraphs and tables, the difficulty in extracting the target content from the PDF is high.

In the prior art (patent of invention with document number CN 111259623A), a paragraph structure mark symbol is added to a PDF document, a PDF document analysis tool is combined to determine paragraph attribute characteristics of the PDF document, and a final extraction model is constructed based on the influence degree of the paragraph attribute characteristics on the paragraph structure, so as to achieve the purpose of a PDF document paragraph; in the prior art, paragraph extraction of a PDF document can be completed only through a neural network model, the PDF document cannot be further analyzed, and effective contents cannot be extracted from the PDF document. Therefore, a content filtering method for a target object is needed.

Disclosure of Invention

The present invention is directed to solving at least one of the problems of the prior art; therefore, the invention provides a content filtering method for a target object, which is used for solving the technical problem that the prior art can not realize the targeted analysis of a PDF document, so that the effective content can not be quickly extracted from the PDF document; according to the method, the target document is read firstly, the target elements are set by combining the type tags of the target document, then the target document is divided and filtered according to the target elements, the target content is obtained, and the effectiveness and the extraction efficiency of the target content can be effectively guaranteed so as to solve the problems.

To achieve the above object, a first aspect of the present invention provides a content filtering method for a target object, including:

reading a target document, and setting a target element by combining a type label of the target document; the target elements are manually or automatically set and comprise one or more of paragraphs, chapters, headers and footers;

dividing and filtering the target document according to the target elements to obtain target content; the type tag is set through the content attribute of the target document, and the target document is a PDF document.

Preferably, the type tag is set according to a content attribute of the target document, and includes:

when the target document is an enterprise annual report, setting the corresponding type tag as 1;

when the target document is a disclosure report, setting the corresponding type tag to 2;

when the target document is a responsibility report, the corresponding type tag is set to 3.

Preferably, the automatically setting the target element according to the type tag of the target document includes:

acquiring a type label of a target document;

extracting the associated elements according to the type labels and the association relation, and marking the associated elements as target elements; wherein the incidence relation is set manually.

Preferably, when the target element includes a paragraph, paragraph segmentation is performed on the target document, including:

reading a target document, and dividing the target document into a plurality of storage units in sequence; each storage unit corresponds to a line of text in the target document;

combining a plurality of storage units with the paragraph clustering model to obtain divided paragraphs; and establishing a paragraph clustering model according to the font characteristic parameters.

Preferably, when the target element includes a chapter, performing chapter division on the target document includes:

reading a target document, and splitting the target document into a plurality of pages;

dividing each page into a plurality of grids; wherein the side length of each grid is half of the minimum font size, and the grids are rectangular;

sequentially acquiring the average gray scale of the grids according to the typesetting sequence, deleting the corresponding grids when the average gray scale is 255, and otherwise marking the grids;

re-reading the target document to obtain characters and simultaneously obtaining the positions of the characters; and associating the character position with the marked grid position, and rearranging the characters according to the grid position and the typesetting sequence.

Preferably, when the target element includes a header, determining a header row of the target document includes:

reading a target document, identifying and obtaining a single character, and carrying out merging conversion on the single character to generate a mode string;

performing similarity calculation on the pattern strings of the front N pages of the target document to determine header rows; wherein N is a positive integer greater than or equal to 10.

Preferably, when the target element includes a footer, determining a footer row of the target document includes:

and performing similarity calculation on the pattern strings of the N pages behind the target document to determine a footer row.

Preferably, when the header line and/or the footer line are determined, corresponding content in the target document is automatically filtered.

Compared with the prior art, the invention has the beneficial effects that:

1. after the target document is read, the target elements are set for the target document by combining the type label of the target document, the target elements can be set manually or automatically through the association relationship, the extraction and filtering requirements of the target document under different scenes can be met, and the extracted content can better meet the requirements of users.

2. The target elements in the invention comprise paragraphs, chapters, headers, footers and the like, aiming at the target documents with different types of labels, the combination of different target elements is set, and the corresponding processing mode is matched for each element, thus the accuracy of extracting the target document can be ensured.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

FIG. 1 is a schematic diagram of the working steps of the present invention.

Detailed Description

The technical solutions of the present invention will be described clearly and completely with reference to the following embodiments, and it should be understood that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

PDF documents are applied to various fields because of their various outstanding advantages, but because they do not have a logical structure for recording documents and do not have logical elements such as paragraphs and tables, it is difficult to extract target content from PDF.

In the prior art (patent of invention with document number CN 111259623A), a paragraph structure mark symbol is added to a PDF document, a PDF document analysis tool is combined to determine paragraph attribute characteristics of the PDF document, and a final extraction model is constructed based on the influence degree of the paragraph attribute characteristics on the paragraph structure, so as to achieve the purpose of a PDF document paragraph; in the prior art, paragraph extraction of a PDF document can be completed only through a neural network model, and headers, footers and other insubstantial contents cannot be effectively extracted.

According to the method, the target document is read firstly, the target elements are set by combining the type tags of the target document, then the target document is divided and filtered according to the target elements, the target content is obtained, and the effectiveness and the extraction efficiency of the target content can be effectively guaranteed.

Referring to fig. 1, a first aspect of the present application provides a content filtering method for a target object, including:

reading a target document, and setting a target element by combining a type label of the target document;

and dividing and filtering the target document according to the target elements to obtain target content.

The section in the application does not discuss a certain section or a certain section in the text, but includes a plurality of contents which are randomly typeset in a certain page in the target document, such as the exhibition about company brief introduction and related numbers in the social responsibility report.

In the application, the type tag is set according to the content attribute of the target document, namely, the type tag is set according to the type of the target document, and if the target document is an enterprise yearbook, the corresponding type tag is set to be 1; when the target document is a disclosure report, setting the corresponding type tag to 2; when the target document is a responsibility report, the corresponding type tag is set to 3.

It is understood that the target document in the present application is a PDF document, and may be other documents consistent with the PDF document generation principle.

The target elements in the application are manually or automatically set and comprise one or more of paragraphs, chapters, headers and footers; when the target element is a paragraph, the paragraph division and extraction of the target document are required, when the target element is a chapter, the chapter extraction of the target document is required, and when the target element is a header or a footer, the header or the footer is required to be filtered; it is understood that the target elements may also include other elements such as text, pictures, etc.

In an alternative embodiment, the target elements are manually set for the target document, including:

when the type tag of the target document is 1, the corresponding target element can be set as a paragraph and a chapter;

when the type tag of the target document is 2, the corresponding target elements may be set as a header and a footer;

when the type tag of the target document is 3, the corresponding target element may be set as text, i.e. the content of the text is extracted.

In the embodiment, target elements are manually set for different types of tags, and on the basis of identifying the type tag of the target document, the target elements are set according to the intention of a user, so that the target document is extracted.

In an alternative embodiment, automatically setting the target element according to the type tag of the target document includes:

acquiring a type label of a target document;

and extracting the associated elements according to the type labels and the association relation, and marking the associated elements as target elements.

In the embodiment, when the type tag of the target document is identified, directly associating elements (target elements), and directly extracting the target document according to the associated elements; the related elements are preset and stored, and the setting of the related elements is specifically explained as follows:

when the type label of the target document is 1, the associated elements corresponding to the type label are paragraphs and chapters;

when the type label of the target document is 2, the associated elements corresponding to the type label are a header and a footer;

and when the type label of the target document is 3, the associated element corresponding to the type label is a text.

In one embodiment, when the target element comprises a paragraph, paragraph splitting the target document comprises:

reading a target document, and dividing the target document into a plurality of storage units in sequence;

combining a plurality of storage units with the paragraph clustering model to obtain the divided paragraphs.

One storage unit in the present embodiment corresponds to one line of text in the target document; the paragraph clustering model is built according to the font characteristic parameters, specifically, the font characteristic parameters (font size, font spacing, font style, line alignment, etc.) and the clustering model are combined to obtain, and the specific building process can refer to blog articles (https:// blog.csdn.net/weixin _45615071/article/details/108124735) and other prior art.

In one embodiment, when the target element includes a chapter, performing chapter division on the target document includes:

dividing each page into a plurality of grids;

The main purpose in the embodiment is to extract different parts in the same target document; the side length of each grid is not less than half of the side length of the minimum font size in the page, and the grids are rectangular, and the side length of each grid is half of the minimum font size in the embodiment.

The typesetting order in this embodiment is from left to right and from top to bottom.

This example is illustrated by way of example:

step 11: reading the text content of each page of the target document, and regenerating the read content into a picture according to the text position and the page size;

step 12: dividing the picture into a plurality of grids;

step 13: obtaining the average gray scale according to the typesetting sequence, deleting the grid when the color corresponding to the average gray scale is white, otherwise marking the grid;

step 14: and associating the character position with the marked grid position, and rearranging the characters to realize the extraction of chapter content.

In one embodiment, when the target element comprises a header, determining a header row of the target document comprises:

and performing similarity calculation on the pattern strings of the first N pages of the target document to determine a header line.

The similarity calculation of the pattern string (character string) can be realized by various methods, such as cosine similarity, matrix similarity and character string editing distance; it can be understood that when a certain line in several target documents is the same, the line can be understood as a header or a footer, and even if no character exists in the line in a certain page, the filtering of the header and the footer of the target document is not influenced.

This example is illustrated by way of example:

step 21: reading a target document, identifying and obtaining a single character;

step 22: merging the single characters according to the rows through the character positions;

step 23: converting a single character into a pattern string, such as "12 Maotai shares Limited group" into [ number ] + [ character ];

step 24: performing similarity calculation on the front 10 pages of the target document by using rows, and determining header rows by adopting a mechanism similar to a PageRank mechanism;

step 25: after the header row is determined, the filtering header is automatically matched when the target document is read.

In one embodiment, when the target element includes a footer, determining a footer row of the target document comprises:

and performing similarity calculation on the pattern strings of the 10 pages behind the target document to determine a footer row.

It should be noted that when the similarity of a certain line in 10 pages exceeds the set similarity threshold, a header line or a footer line may be determined; when the target element in the present application includes a plurality of elements, the target document is processed according to the element order.

One of the core points of this application: according to the method and the device, after the target document is read, the target element is set for the target document by combining the type tag of the target document, manual setting can be performed, automatic setting can also be performed through the incidence relation, the extraction and filtering requirements for the target document under different scenes can be met, and the extracted content can better meet the user requirements.

The second core point of the application is: the target elements in the application comprise paragraphs, chapters, headers, footers and the like, combinations of different target elements are set for target documents of different types of labels, corresponding processing modes are matched for the elements, and the accuracy of target document extraction can be guaranteed.

Part of data in the formula is obtained by removing dimension and taking the value to calculate, and the formula is obtained by simulating a large amount of collected data through software and is closest to a real situation; the preset parameters and the preset threshold values in the formula are set by those skilled in the art according to actual conditions or obtained through simulation of a large amount of data.

The working principle of the invention is as follows:

reading a target document and acquiring a corresponding type label; and setting target elements for the type labels in a manual setting or automatic setting mode.

And calling a processing method corresponding to the target element, and analyzing the target document page by page to acquire target content.

Although the present invention has been described in detail with reference to the preferred embodiments, it will be understood by those skilled in the art that various changes may be made and equivalents may be substituted for elements thereof without departing from the spirit and scope of the present invention.

Claims

1. A content filtering method for a target object, comprising:

dividing and filtering the target document according to the target elements to obtain target content; wherein the type tag is set by a content attribute of the target document.

2. The content filtering method for a target object according to claim 1, wherein automatically setting a target element according to a type tag of the target document comprises:

acquiring a type label of a target document;

3. The method for filtering contents of a target object according to claim 1, wherein when said target element includes a paragraph, paragraph division is performed on a target document, including:

4. The content filtering method for a target object according to claim 1, wherein when the target element includes a chapter, performing chapter division on a target document includes:

5. The method of claim 1, wherein when the target element comprises a header, determining a header line of the target document comprises:

6. The content filtering method for the target object according to claim 5, wherein when the target element includes a footer, determining a footer line of the target document comprises:

7. The method of claim 6, wherein corresponding content in the target document is automatically filtered when determining the header line and/or the footer line.