CN110287458B - Automatic annual newspaper text title labeling system - Google Patents

Automatic annual newspaper text title labeling system Download PDF

Info

Publication number
CN110287458B
CN110287458B CN201910416616.XA CN201910416616A CN110287458B CN 110287458 B CN110287458 B CN 110287458B CN 201910416616 A CN201910416616 A CN 201910416616A CN 110287458 B CN110287458 B CN 110287458B
Authority
CN
China
Prior art keywords
title
titles
level
template
marked
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910416616.XA
Other languages
Chinese (zh)
Other versions
CN110287458A (en
Inventor
梁倬骞
潘定
罗旭
龙舜
伍旭
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Jinan University
Original Assignee
Jinan University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Jinan University filed Critical Jinan University
Priority to CN201910416616.XA priority Critical patent/CN110287458B/en
Publication of CN110287458A publication Critical patent/CN110287458A/en
Application granted granted Critical
Publication of CN110287458B publication Critical patent/CN110287458B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/103Formatting, i.e. changing of presentation of documents
    • G06F40/117Tagging; Marking up; Designating a block; Setting of attributes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/166Editing, e.g. inserting or deleting
    • G06F40/186Templates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/258Heading extraction; Automatic titling; Numbering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q40/00Finance; Insurance; Tax strategies; Processing of corporate or income taxes
    • G06Q40/12Accounting
    • G06Q40/125Finance or payroll

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Business, Economics & Management (AREA)
  • Health & Medical Sciences (AREA)
  • Finance (AREA)
  • Accounting & Taxation (AREA)
  • Data Mining & Analysis (AREA)
  • Strategic Management (AREA)
  • General Business, Economics & Management (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Technology Law (AREA)
  • Marketing (AREA)
  • Economics (AREA)
  • Development Economics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Machine Translation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses an automatic annual report text title labeling system, which comprises the following specific methods: A. the primary title and the secondary title of different levels are marked twice; B. matching the primary title, the first label is matched with the title template completely, M, S labels are added to the primary title reaching the threshold through similarity calculation, the override labels are screened, and the second label is matched with the title marked through similarity calculation for the second time, so that the technical field of annual report text title labeling is related. The automatic labeling system for the annual report text titles selects a machine vision method to identify the financial report text layout, and organically combines the machine vision and rule statistics text extraction method to solve the problem that titles are difficult to label accurately.

Description

Automatic annual newspaper text title labeling system
Technical Field
The invention relates to the technical field of annual report text title labeling, in particular to an automatic annual report text title labeling system.
Background
The financial report text has strict standardization and meticulous logic structure, the chapter and paragraph of the financial report text contains rich disclosure information, the style of the fonts in the PDF is identified as a feature of title extraction in the financial report, after full analysis, the deep analysis PDF format is found to be capable of realizing the identification of the PDF Chinese text, however, since the PDF reported by enterprises does not have a strictly uniform template to require format information such as fonts, even if the fonts of the titles of all levels in the PDF are obtained, the font information is difficult to unify into an extraction rule.
When people visually touch the financial report text, the user can always directly judge which is the title and which is the text according to a priori knowledge, namely, when reading a annual report, the user can judge the chapter structure and the summary content of the PDF financial report according to the standardized disclosure mode of the annual report.
The inventor inspires by the principle of mathematical morphology, the invention converts the problem of visual annual report summary recognition into the filtering behavior of mathematical morphology, considers the real operation environment, selects a machine vision method to recognize the text layout of the financial report, and organically combines the machine vision and rule statistics text extraction method to solve the problem of difficult and accurate labeling of titles.
Disclosure of Invention
(one) solving the technical problems
Aiming at the defects of the prior art, the invention provides an automatic annual newspaper text title labeling system, which solves the problem that titles are difficult to label accurately.
(II) technical scheme
In order to achieve the above purpose, the invention is realized by the following technical scheme: the automatic annual report text title labeling system comprises the following specific steps:
A. the primary title and the secondary title of different levels are marked twice;
B. matching the first-level title, wherein the first-level title is completely matched with the title template, M, S is added to the first-level title which reaches a threshold value through similarity calculation, override-level marks are screened, and the second-level marks are secondarily matched with the title marked through similarity calculation;
C. and for the secondary titles, calculating the matching degree of the secondary titles and all the secondary titles of the primary titles corresponding to the templates, determining whether the titles are finally added with labels according to the characteristics of the financial report, and for the secondary titles not, matching the adjacent previous and next titles with the corresponding primary titles in the templates, and determining whether the labels are finally added by judging whether the two adjacent primary titles are completely matched or reach a certain similarity value.
Preferably, the specific steps marked for the first time in the step B are as follows:
and step 1, the financial report text and the template are identical in title and marked as M.
Step 2, the identical titles cannot be matched in the template, and a title with the highest similarity is found out from the template through similarity calculation and is marked as S.
And 3, after the step 2 is completed, the condition that part of titles with different levels can be marked is generated, the marks of the titles with different levels are removed, and the patterns of the marked titles M are counted to obtain the patterns of the titles.
Preferably, the second labeling in step B will be to add the undesirable titles in the S label to the partial title by calculating the similarity.
Preferably, the second labeling comprises the following specific steps:
step 1, reading title content marked as S;
step 2, obtaining the number total of secondary titles of the current title corresponding to the title of the template;
step 3, if total is equal to 0, judging whether the primary title of the context exists in the template or not and whether the primary title is positioned at a similar position, if so, deleting the S label, and adding the M label, namely, considering that the primary title meets the matching requirement;
if total is not equal to 0, calculating the similarity between all secondary titles of the current title and the corresponding secondary title of the template. If the similarity value reaches the threshold value, deleting the S label, adding the M label, otherwise, deleting the S label only;
and 4, obtaining a final title labeling result.
Preferably, the title template covers only financial reports.
(III) beneficial effects
The invention provides an automatic annual report text title labeling system. The beneficial effects are as follows:
(1) According to the automatic annual report text title labeling system, a title labeling method considering the context information of the annual report is provided through text similarity calculation, the annual report title is considered to be used as a short text (such as tax item, title of only one word), the title is difficult to label directly through a cosine coefficient similarity calculation method, the financial report disclosure often has better context information requirements and a canonical disclosure sequence, the context information of the annual report title is considered to be combined and considered in combination with the characteristic, the similarity of the text lines in the title and the template is calculated, and the annual report can be mapped to the template line corresponding to the title more accurately, so that the annual report is labeled through the template, and the scope and accuracy of the labeling can be greatly improved through the title labeling method combining the context information.
Drawings
FIG. 1 is a flow chart of a first-time annotated preliminary matching title of the present invention;
FIG. 2 is a flow chart of a header of a second time noted screening portion mark of the present invention;
Detailed Description
The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
Referring to fig. 1-2, the embodiment of the invention provides a technical scheme: the automatic annual report text title labeling system comprises the following specific steps:
A. the primary title and the secondary title of different levels are marked twice;
B. matching the first-level title, wherein the first-level title is completely matched with the title template, M, S is added to the first-level title which reaches a threshold value through similarity calculation, override-level marks are screened, and the second-level marks are secondarily matched with the title marked through similarity calculation;
C. for the second-level titles, calculating the matching degree of the second-level titles and all the second-level titles corresponding to the first-level titles of the templates, determining whether the titles are finally added with labels according to the characteristics of the financial report, and for the second-level titles not existing, matching the adjacent last and next titles with the corresponding first-level titles in the templates, and determining whether the labels are finally added by judging whether the two-level titles are completely matched or reach a certain similarity value;
further, the specific steps of the first labeling in the step B are as follows:
step 1, a financial report text and a template are identical in title and marked as M;
step 2, the identical titles cannot be matched in the template, and a title with the highest similarity is found out from the template through similarity calculation and is marked as S;
step 3, after the step 2 is completed, the condition that part of titles with different levels can be marked is generated, marks of the titles with different levels are removed, and the patterns of the marked titles M are counted to obtain the patterns of the titles;
further, the second labeling in the step B is to add the unsatisfactory titles in the S labels to the partial titles through calculating the similarity for screening;
further, the second labeling comprises the following specific steps:
step 1, reading title content marked as S;
step 2, obtaining the number total of secondary titles of the current title corresponding to the title of the template;
step 3, if total is equal to 0, judging whether the primary title of the context exists in the template or not and whether the primary title is positioned at a similar position, if so, deleting the S label, and adding the M label, namely, considering that the primary title meets the matching requirement;
if total is not equal to 0, calculating the similarity between all secondary titles of the current title and the corresponding secondary title of the template. If the similarity value reaches the threshold value, deleting the S label, adding the M label, otherwise, deleting the S label only;
step 4, obtaining a final title labeling result;
further, the coverage of the title template is only in the financial report, and the invention mainly works in a financial report module for analyzing the annual report, so that the coverage of the title template is only in the financial report module, and in the annual report, the financial report belongs to a primary title, and the primary title and a tertiary title are included below the primary title; for the title template, the primary title is equivalent to the secondary title in "financial report", and similarly, the secondary title is equivalent to the tertiary title in "financial report"; the title template is extracted after the annual reports of 100 fields are manually analyzed, and the final title labeling result shows that the template has better coverage and higher title matching rate, and the matching degree of the primary title and the secondary title can reach more than 80 percent.
In order to reflect the logic relation between titles of different levels and improve the practicability of the codes, the rules of the title template codes are as follows:
first-order title: starting with 10000, increasing in units of 10000, namely, the primary header coding has the remarkable characteristic of being divisible by 10000 from the mathematical point of view;
second-level title: the coding is started by adding '100' to the coding of the primary title to which the coding belongs, and the coding is increased by taking '100' as a unit, namely, the secondary title coding has the remarkable characteristic of being capable of being divided by '100' and not being divided by '10000' from the mathematical point of view.
It is noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation.
Although embodiments of the present invention have been shown and described, it will be understood by those skilled in the art that various changes, modifications, substitutions and alterations can be made therein without departing from the principles and spirit of the invention, the scope of which is defined in the appended claims and their equivalents.

Claims (4)

1. The automatic annual report text title labeling system comprises the following specific steps:
A. the primary title and the secondary title of different levels are marked twice;
B. matching the first-level title, wherein the first-level title is completely matched with the title template, M, S is added to the first-level title which reaches a threshold value through similarity calculation, override-level marks are screened, and the second-level marks are secondarily matched with the title marked through similarity calculation;
C. for the second-level titles, calculating the matching degree of the second-level titles and all the second-level titles corresponding to the first-level titles of the templates, determining whether the titles are finally added with labels according to the characteristics of the financial report, and for the second-level titles not existing, matching the adjacent last and next titles with the corresponding first-level titles in the templates, and determining whether the labels are finally added by judging whether the two-level titles are completely matched or reach a certain similarity value;
the second labeling comprises the following specific steps:
step 1, reading title content marked as S;
step 2, obtaining the number total of secondary titles of the current title corresponding to the title of the template;
step 3, if total is equal to 0, judging whether the primary title of the context exists in the template or not and whether the primary title is positioned at a similar position, if so, deleting the S label, and adding the M label, namely, considering that the primary title meets the matching requirement;
if total is not equal to 0, calculating the similarity between all secondary titles of the current title and the corresponding secondary titles of the template, deleting the S label and adding the M label if the similarity value reaches a threshold value, otherwise, deleting the S label only;
and 4, obtaining a final title labeling result.
2. The automatic annual report text heading labeling system of claim 1, wherein: the specific steps of the first labeling in the step B are as follows:
step 1, a financial report text and a template are identical in title and marked as M;
step 2, the identical titles cannot be matched in the template, and a title with the highest similarity is found out from the template through similarity calculation and is marked as S;
and 3, after the step 2 is completed, the condition that part of titles with different levels can be marked is generated, the marks of the titles with different levels are removed, and the patterns of the marked titles M are counted to obtain the patterns of the titles.
3. The automatic annual report text heading labeling system of claim 2, wherein: the second labeling in step B will be to add the unsatisfactory title in the S label to the partial title by computing the similarity.
4. The automatic annual report text heading labeling system of claim 1, wherein: the title template covers only the financial reports.
CN201910416616.XA 2019-05-20 2019-05-20 Automatic annual newspaper text title labeling system Active CN110287458B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910416616.XA CN110287458B (en) 2019-05-20 2019-05-20 Automatic annual newspaper text title labeling system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910416616.XA CN110287458B (en) 2019-05-20 2019-05-20 Automatic annual newspaper text title labeling system

Publications (2)

Publication Number Publication Date
CN110287458A CN110287458A (en) 2019-09-27
CN110287458B true CN110287458B (en) 2023-05-02

Family

ID=68002115

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910416616.XA Active CN110287458B (en) 2019-05-20 2019-05-20 Automatic annual newspaper text title labeling system

Country Status (1)

Country Link
CN (1) CN110287458B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111026849B (en) * 2019-12-17 2023-09-19 北京百度网讯科技有限公司 Data processing method and device

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102780647A (en) * 2012-07-21 2012-11-14 上海量明科技发展有限公司 Method, client and system for implementing mind map function by instant messaging tool
CN105045847A (en) * 2015-07-01 2015-11-11 广州市万隆证券咨询顾问有限公司 Method for extracting Chinese institutional unit name from text information
CN105809458A (en) * 2014-12-29 2016-07-27 苏宁云商集团股份有限公司 Advertisement accurate delivery method and system in e-commerce site
CN109101538A (en) * 2018-06-29 2018-12-28 中译语通科技股份有限公司 A kind of entity abstracting method and system towards Chinese patent text
CN109766524A (en) * 2018-12-28 2019-05-17 重庆邮电大学 A kind of merger & reorganization class notice information abstracting method and system

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102780647A (en) * 2012-07-21 2012-11-14 上海量明科技发展有限公司 Method, client and system for implementing mind map function by instant messaging tool
CN105809458A (en) * 2014-12-29 2016-07-27 苏宁云商集团股份有限公司 Advertisement accurate delivery method and system in e-commerce site
CN105045847A (en) * 2015-07-01 2015-11-11 广州市万隆证券咨询顾问有限公司 Method for extracting Chinese institutional unit name from text information
CN109101538A (en) * 2018-06-29 2018-12-28 中译语通科技股份有限公司 A kind of entity abstracting method and system towards Chinese patent text
CN109766524A (en) * 2018-12-28 2019-05-17 重庆邮电大学 A kind of merger & reorganization class notice information abstracting method and system

Also Published As

Publication number Publication date
CN110287458A (en) 2019-09-27

Similar Documents

Publication Publication Date Title
CN114610515B (en) Multi-feature log anomaly detection method and system based on log full semantics
US11756323B2 (en) Method of automatically recognizing and classifying design information in imaged PID drawing and method of automatically creating intelligent PID drawing using design information stored in database
JP3842006B2 (en) Form classification device, form classification method, and computer-readable recording medium storing a program for causing a computer to execute these methods
CN109145260B (en) Automatic text information extraction method
CN107122342B (en) Text code recognition method and device
CN110175334B (en) Text knowledge extraction system and method based on custom knowledge slot structure
CN111581345A (en) Document level event extraction method and device
CN107169321B (en) Program plagiarism detection method and system based on combination of attribute counting and structure measurement technology
CN110765231A (en) Chapter event extraction method based on common-finger fusion
CN110287458B (en) Automatic annual newspaper text title labeling system
CN112257413A (en) Address parameter processing method and related equipment
CN114154484B (en) Construction professional term library intelligent construction method based on mixed depth semantic mining
CN110795607A (en) Equipment guarantee data matching method and system based on multi-stage similarity calculation
CN107133201B (en) Hot spot information acquisition method and device based on text code recognition
CN111026743B (en) Rail transit engineering project structure data standardization method
CN111754352A (en) Method, device, equipment and storage medium for judging correctness of viewpoint statement
CN115687790B (en) Advertisement pushing method and system based on big data and cloud platform
CN111291535A (en) Script processing method and device, electronic equipment and computer readable storage medium
CN108073678A (en) Applied to document analyzing and processing method, system and the device in big data analysis
CN114595661B (en) Method, apparatus, and medium for reviewing bid document
CN116522872A (en) Similarity calculation-based metadata field Chinese name completion method, storage medium and system
CN113609864B (en) Text semantic recognition processing system and method based on industrial control system
CN114611515B (en) Method and system for identifying enterprise actual control person based on enterprise public opinion information
CN109993381B (en) Demand management application method, device, equipment and medium based on knowledge graph
CN115294593A (en) Image information extraction method and device, computer equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant