CN110287458B

CN110287458B - Automatic annual newspaper text title labeling system

Info

Publication number: CN110287458B
Application number: CN201910416616.XA
Authority: CN
Inventors: 梁倬骞; 潘定; 罗旭; 龙舜; 伍旭
Original assignee: Jinan University
Current assignee: Jinan University
Priority date: 2019-05-20
Filing date: 2019-05-20
Publication date: 2023-05-02
Anticipated expiration: 2039-05-20
Also published as: CN110287458A

Abstract

The invention discloses an automatic annual report text title labeling system, which comprises the following specific methods: A. the primary title and the secondary title of different levels are marked twice; B. matching the primary title, the first label is matched with the title template completely, M, S labels are added to the primary title reaching the threshold through similarity calculation, the override labels are screened, and the second label is matched with the title marked through similarity calculation for the second time, so that the technical field of annual report text title labeling is related. The automatic labeling system for the annual report text titles selects a machine vision method to identify the financial report text layout, and organically combines the machine vision and rule statistics text extraction method to solve the problem that titles are difficult to label accurately.

Description

Automatic annual newspaper text title labeling system

Technical Field

The invention relates to the technical field of annual report text title labeling, in particular to an automatic annual report text title labeling system.

Background

The financial report text has strict standardization and meticulous logic structure, the chapter and paragraph of the financial report text contains rich disclosure information, the style of the fonts in the PDF is identified as a feature of title extraction in the financial report, after full analysis, the deep analysis PDF format is found to be capable of realizing the identification of the PDF Chinese text, however, since the PDF reported by enterprises does not have a strictly uniform template to require format information such as fonts, even if the fonts of the titles of all levels in the PDF are obtained, the font information is difficult to unify into an extraction rule.

When people visually touch the financial report text, the user can always directly judge which is the title and which is the text according to a priori knowledge, namely, when reading a annual report, the user can judge the chapter structure and the summary content of the PDF financial report according to the standardized disclosure mode of the annual report.

The inventor inspires by the principle of mathematical morphology, the invention converts the problem of visual annual report summary recognition into the filtering behavior of mathematical morphology, considers the real operation environment, selects a machine vision method to recognize the text layout of the financial report, and organically combines the machine vision and rule statistics text extraction method to solve the problem of difficult and accurate labeling of titles.

Disclosure of Invention

(one) solving the technical problems

Aiming at the defects of the prior art, the invention provides an automatic annual newspaper text title labeling system, which solves the problem that titles are difficult to label accurately.

(II) technical scheme

In order to achieve the above purpose, the invention is realized by the following technical scheme: the automatic annual report text title labeling system comprises the following specific steps:

A. the primary title and the secondary title of different levels are marked twice;

B. matching the first-level title, wherein the first-level title is completely matched with the title template, M, S is added to the first-level title which reaches a threshold value through similarity calculation, override-level marks are screened, and the second-level marks are secondarily matched with the title marked through similarity calculation;

C. and for the secondary titles, calculating the matching degree of the secondary titles and all the secondary titles of the primary titles corresponding to the templates, determining whether the titles are finally added with labels according to the characteristics of the financial report, and for the secondary titles not, matching the adjacent previous and next titles with the corresponding primary titles in the templates, and determining whether the labels are finally added by judging whether the two adjacent primary titles are completely matched or reach a certain similarity value.

Preferably, the specific steps marked for the first time in the step B are as follows:

and step 1, the financial report text and the template are identical in title and marked as M.

Step 2, the identical titles cannot be matched in the template, and a title with the highest similarity is found out from the template through similarity calculation and is marked as S.

And 3, after the step 2 is completed, the condition that part of titles with different levels can be marked is generated, the marks of the titles with different levels are removed, and the patterns of the marked titles M are counted to obtain the patterns of the titles.

Preferably, the second labeling in step B will be to add the undesirable titles in the S label to the partial title by calculating the similarity.

Preferably, the second labeling comprises the following specific steps:

step 1, reading title content marked as S;

step 2, obtaining the number total of secondary titles of the current title corresponding to the title of the template;

step 3, if total is equal to 0, judging whether the primary title of the context exists in the template or not and whether the primary title is positioned at a similar position, if so, deleting the S label, and adding the M label, namely, considering that the primary title meets the matching requirement;

if total is not equal to 0, calculating the similarity between all secondary titles of the current title and the corresponding secondary title of the template. If the similarity value reaches the threshold value, deleting the S label, adding the M label, otherwise, deleting the S label only;

and 4, obtaining a final title labeling result.

Preferably, the title template covers only financial reports.

(III) beneficial effects

The invention provides an automatic annual report text title labeling system. The beneficial effects are as follows:

(1) According to the automatic annual report text title labeling system, a title labeling method considering the context information of the annual report is provided through text similarity calculation, the annual report title is considered to be used as a short text (such as tax item, title of only one word), the title is difficult to label directly through a cosine coefficient similarity calculation method, the financial report disclosure often has better context information requirements and a canonical disclosure sequence, the context information of the annual report title is considered to be combined and considered in combination with the characteristic, the similarity of the text lines in the title and the template is calculated, and the annual report can be mapped to the template line corresponding to the title more accurately, so that the annual report is labeled through the template, and the scope and accuracy of the labeling can be greatly improved through the title labeling method combining the context information.

Drawings

FIG. 1 is a flow chart of a first-time annotated preliminary matching title of the present invention;

FIG. 2 is a flow chart of a header of a second time noted screening portion mark of the present invention;

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

Referring to fig. 1-2, the embodiment of the invention provides a technical scheme: the automatic annual report text title labeling system comprises the following specific steps:

C. for the second-level titles, calculating the matching degree of the second-level titles and all the second-level titles corresponding to the first-level titles of the templates, determining whether the titles are finally added with labels according to the characteristics of the financial report, and for the second-level titles not existing, matching the adjacent last and next titles with the corresponding first-level titles in the templates, and determining whether the labels are finally added by judging whether the two-level titles are completely matched or reach a certain similarity value;

further, the specific steps of the first labeling in the step B are as follows:

step 1, a financial report text and a template are identical in title and marked as M;

step 2, the identical titles cannot be matched in the template, and a title with the highest similarity is found out from the template through similarity calculation and is marked as S;

step 3, after the step 2 is completed, the condition that part of titles with different levels can be marked is generated, marks of the titles with different levels are removed, and the patterns of the marked titles M are counted to obtain the patterns of the titles;

further, the second labeling in the step B is to add the unsatisfactory titles in the S labels to the partial titles through calculating the similarity for screening;

further, the second labeling comprises the following specific steps:

step 1, reading title content marked as S;

step 4, obtaining a final title labeling result;

further, the coverage of the title template is only in the financial report, and the invention mainly works in a financial report module for analyzing the annual report, so that the coverage of the title template is only in the financial report module, and in the annual report, the financial report belongs to a primary title, and the primary title and a tertiary title are included below the primary title; for the title template, the primary title is equivalent to the secondary title in "financial report", and similarly, the secondary title is equivalent to the tertiary title in "financial report"; the title template is extracted after the annual reports of 100 fields are manually analyzed, and the final title labeling result shows that the template has better coverage and higher title matching rate, and the matching degree of the primary title and the secondary title can reach more than 80 percent.

In order to reflect the logic relation between titles of different levels and improve the practicability of the codes, the rules of the title template codes are as follows:

first-order title: starting with 10000, increasing in units of 10000, namely, the primary header coding has the remarkable characteristic of being divisible by 10000 from the mathematical point of view;

second-level title: the coding is started by adding '100' to the coding of the primary title to which the coding belongs, and the coding is increased by taking '100' as a unit, namely, the secondary title coding has the remarkable characteristic of being capable of being divided by '100' and not being divided by '10000' from the mathematical point of view.

It is noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation.

Although embodiments of the present invention have been shown and described, it will be understood by those skilled in the art that various changes, modifications, substitutions and alterations can be made therein without departing from the principles and spirit of the invention, the scope of which is defined in the appended claims and their equivalents.

Claims

1. The automatic annual report text title labeling system comprises the following specific steps:

the second labeling comprises the following specific steps:

step 1, reading title content marked as S;

if total is not equal to 0, calculating the similarity between all secondary titles of the current title and the corresponding secondary titles of the template, deleting the S label and adding the M label if the similarity value reaches a threshold value, otherwise, deleting the S label only;

and 4, obtaining a final title labeling result.

2. The automatic annual report text heading labeling system of claim 1, wherein: the specific steps of the first labeling in the step B are as follows:

3. The automatic annual report text heading labeling system of claim 2, wherein: the second labeling in step B will be to add the unsatisfactory title in the S label to the partial title by computing the similarity.

4. The automatic annual report text heading labeling system of claim 1, wherein: the title template covers only the financial reports.