CN110287458B - Automatic annual newspaper text title labeling system - Google Patents
Automatic annual newspaper text title labeling system Download PDFInfo
- Publication number
- CN110287458B CN110287458B CN201910416616.XA CN201910416616A CN110287458B CN 110287458 B CN110287458 B CN 110287458B CN 201910416616 A CN201910416616 A CN 201910416616A CN 110287458 B CN110287458 B CN 110287458B
- Authority
- CN
- China
- Prior art keywords
- title
- titles
- level
- template
- marked
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/22—Matching criteria, e.g. proximity measures
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/10—Text processing
- G06F40/103—Formatting, i.e. changing of presentation of documents
- G06F40/117—Tagging; Marking up; Designating a block; Setting of attributes
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/10—Text processing
- G06F40/166—Editing, e.g. inserting or deleting
- G06F40/186—Templates
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/205—Parsing
- G06F40/216—Parsing using statistical methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/258—Heading extraction; Automatic titling; Numbering
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q40/00—Finance; Insurance; Tax strategies; Processing of corporate or income taxes
- G06Q40/12—Accounting
- G06Q40/125—Finance or payroll
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Artificial Intelligence (AREA)
- General Engineering & Computer Science (AREA)
- General Health & Medical Sciences (AREA)
- Computational Linguistics (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Business, Economics & Management (AREA)
- Health & Medical Sciences (AREA)
- Finance (AREA)
- Accounting & Taxation (AREA)
- Data Mining & Analysis (AREA)
- Strategic Management (AREA)
- General Business, Economics & Management (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Biology (AREA)
- Evolutionary Computation (AREA)
- Life Sciences & Earth Sciences (AREA)
- Technology Law (AREA)
- Marketing (AREA)
- Economics (AREA)
- Development Economics (AREA)
- Probability & Statistics with Applications (AREA)
- Machine Translation (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses an automatic annual report text title labeling system, which comprises the following specific methods: A. the primary title and the secondary title of different levels are marked twice; B. matching the primary title, the first label is matched with the title template completely, M, S labels are added to the primary title reaching the threshold through similarity calculation, the override labels are screened, and the second label is matched with the title marked through similarity calculation for the second time, so that the technical field of annual report text title labeling is related. The automatic labeling system for the annual report text titles selects a machine vision method to identify the financial report text layout, and organically combines the machine vision and rule statistics text extraction method to solve the problem that titles are difficult to label accurately.
Description
Technical Field
The invention relates to the technical field of annual report text title labeling, in particular to an automatic annual report text title labeling system.
Background
The financial report text has strict standardization and meticulous logic structure, the chapter and paragraph of the financial report text contains rich disclosure information, the style of the fonts in the PDF is identified as a feature of title extraction in the financial report, after full analysis, the deep analysis PDF format is found to be capable of realizing the identification of the PDF Chinese text, however, since the PDF reported by enterprises does not have a strictly uniform template to require format information such as fonts, even if the fonts of the titles of all levels in the PDF are obtained, the font information is difficult to unify into an extraction rule.
When people visually touch the financial report text, the user can always directly judge which is the title and which is the text according to a priori knowledge, namely, when reading a annual report, the user can judge the chapter structure and the summary content of the PDF financial report according to the standardized disclosure mode of the annual report.
The inventor inspires by the principle of mathematical morphology, the invention converts the problem of visual annual report summary recognition into the filtering behavior of mathematical morphology, considers the real operation environment, selects a machine vision method to recognize the text layout of the financial report, and organically combines the machine vision and rule statistics text extraction method to solve the problem of difficult and accurate labeling of titles.
Disclosure of Invention
(one) solving the technical problems
Aiming at the defects of the prior art, the invention provides an automatic annual newspaper text title labeling system, which solves the problem that titles are difficult to label accurately.
(II) technical scheme
In order to achieve the above purpose, the invention is realized by the following technical scheme: the automatic annual report text title labeling system comprises the following specific steps:
A. the primary title and the secondary title of different levels are marked twice;
B. matching the first-level title, wherein the first-level title is completely matched with the title template, M, S is added to the first-level title which reaches a threshold value through similarity calculation, override-level marks are screened, and the second-level marks are secondarily matched with the title marked through similarity calculation;
C. and for the secondary titles, calculating the matching degree of the secondary titles and all the secondary titles of the primary titles corresponding to the templates, determining whether the titles are finally added with labels according to the characteristics of the financial report, and for the secondary titles not, matching the adjacent previous and next titles with the corresponding primary titles in the templates, and determining whether the labels are finally added by judging whether the two adjacent primary titles are completely matched or reach a certain similarity value.
Preferably, the specific steps marked for the first time in the step B are as follows:
and step 1, the financial report text and the template are identical in title and marked as M.
Step 2, the identical titles cannot be matched in the template, and a title with the highest similarity is found out from the template through similarity calculation and is marked as S.
And 3, after the step 2 is completed, the condition that part of titles with different levels can be marked is generated, the marks of the titles with different levels are removed, and the patterns of the marked titles M are counted to obtain the patterns of the titles.
Preferably, the second labeling in step B will be to add the undesirable titles in the S label to the partial title by calculating the similarity.
Preferably, the second labeling comprises the following specific steps:
step 1, reading title content marked as S;
step 2, obtaining the number total of secondary titles of the current title corresponding to the title of the template;
step 3, if total is equal to 0, judging whether the primary title of the context exists in the template or not and whether the primary title is positioned at a similar position, if so, deleting the S label, and adding the M label, namely, considering that the primary title meets the matching requirement;
if total is not equal to 0, calculating the similarity between all secondary titles of the current title and the corresponding secondary title of the template. If the similarity value reaches the threshold value, deleting the S label, adding the M label, otherwise, deleting the S label only;
and 4, obtaining a final title labeling result.
Preferably, the title template covers only financial reports.
(III) beneficial effects
The invention provides an automatic annual report text title labeling system. The beneficial effects are as follows:
(1) According to the automatic annual report text title labeling system, a title labeling method considering the context information of the annual report is provided through text similarity calculation, the annual report title is considered to be used as a short text (such as tax item, title of only one word), the title is difficult to label directly through a cosine coefficient similarity calculation method, the financial report disclosure often has better context information requirements and a canonical disclosure sequence, the context information of the annual report title is considered to be combined and considered in combination with the characteristic, the similarity of the text lines in the title and the template is calculated, and the annual report can be mapped to the template line corresponding to the title more accurately, so that the annual report is labeled through the template, and the scope and accuracy of the labeling can be greatly improved through the title labeling method combining the context information.
Drawings
FIG. 1 is a flow chart of a first-time annotated preliminary matching title of the present invention;
FIG. 2 is a flow chart of a header of a second time noted screening portion mark of the present invention;
Detailed Description
The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
Referring to fig. 1-2, the embodiment of the invention provides a technical scheme: the automatic annual report text title labeling system comprises the following specific steps:
A. the primary title and the secondary title of different levels are marked twice;
B. matching the first-level title, wherein the first-level title is completely matched with the title template, M, S is added to the first-level title which reaches a threshold value through similarity calculation, override-level marks are screened, and the second-level marks are secondarily matched with the title marked through similarity calculation;
C. for the second-level titles, calculating the matching degree of the second-level titles and all the second-level titles corresponding to the first-level titles of the templates, determining whether the titles are finally added with labels according to the characteristics of the financial report, and for the second-level titles not existing, matching the adjacent last and next titles with the corresponding first-level titles in the templates, and determining whether the labels are finally added by judging whether the two-level titles are completely matched or reach a certain similarity value;
further, the specific steps of the first labeling in the step B are as follows:
step 1, a financial report text and a template are identical in title and marked as M;
step 2, the identical titles cannot be matched in the template, and a title with the highest similarity is found out from the template through similarity calculation and is marked as S;
step 3, after the step 2 is completed, the condition that part of titles with different levels can be marked is generated, marks of the titles with different levels are removed, and the patterns of the marked titles M are counted to obtain the patterns of the titles;
further, the second labeling in the step B is to add the unsatisfactory titles in the S labels to the partial titles through calculating the similarity for screening;
further, the second labeling comprises the following specific steps:
step 1, reading title content marked as S;
step 2, obtaining the number total of secondary titles of the current title corresponding to the title of the template;
step 3, if total is equal to 0, judging whether the primary title of the context exists in the template or not and whether the primary title is positioned at a similar position, if so, deleting the S label, and adding the M label, namely, considering that the primary title meets the matching requirement;
if total is not equal to 0, calculating the similarity between all secondary titles of the current title and the corresponding secondary title of the template. If the similarity value reaches the threshold value, deleting the S label, adding the M label, otherwise, deleting the S label only;
step 4, obtaining a final title labeling result;
further, the coverage of the title template is only in the financial report, and the invention mainly works in a financial report module for analyzing the annual report, so that the coverage of the title template is only in the financial report module, and in the annual report, the financial report belongs to a primary title, and the primary title and a tertiary title are included below the primary title; for the title template, the primary title is equivalent to the secondary title in "financial report", and similarly, the secondary title is equivalent to the tertiary title in "financial report"; the title template is extracted after the annual reports of 100 fields are manually analyzed, and the final title labeling result shows that the template has better coverage and higher title matching rate, and the matching degree of the primary title and the secondary title can reach more than 80 percent.
In order to reflect the logic relation between titles of different levels and improve the practicability of the codes, the rules of the title template codes are as follows:
first-order title: starting with 10000, increasing in units of 10000, namely, the primary header coding has the remarkable characteristic of being divisible by 10000 from the mathematical point of view;
second-level title: the coding is started by adding '100' to the coding of the primary title to which the coding belongs, and the coding is increased by taking '100' as a unit, namely, the secondary title coding has the remarkable characteristic of being capable of being divided by '100' and not being divided by '10000' from the mathematical point of view.
It is noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation.
Although embodiments of the present invention have been shown and described, it will be understood by those skilled in the art that various changes, modifications, substitutions and alterations can be made therein without departing from the principles and spirit of the invention, the scope of which is defined in the appended claims and their equivalents.
Claims (4)
1. The automatic annual report text title labeling system comprises the following specific steps:
A. the primary title and the secondary title of different levels are marked twice;
B. matching the first-level title, wherein the first-level title is completely matched with the title template, M, S is added to the first-level title which reaches a threshold value through similarity calculation, override-level marks are screened, and the second-level marks are secondarily matched with the title marked through similarity calculation;
C. for the second-level titles, calculating the matching degree of the second-level titles and all the second-level titles corresponding to the first-level titles of the templates, determining whether the titles are finally added with labels according to the characteristics of the financial report, and for the second-level titles not existing, matching the adjacent last and next titles with the corresponding first-level titles in the templates, and determining whether the labels are finally added by judging whether the two-level titles are completely matched or reach a certain similarity value;
the second labeling comprises the following specific steps:
step 1, reading title content marked as S;
step 2, obtaining the number total of secondary titles of the current title corresponding to the title of the template;
step 3, if total is equal to 0, judging whether the primary title of the context exists in the template or not and whether the primary title is positioned at a similar position, if so, deleting the S label, and adding the M label, namely, considering that the primary title meets the matching requirement;
if total is not equal to 0, calculating the similarity between all secondary titles of the current title and the corresponding secondary titles of the template, deleting the S label and adding the M label if the similarity value reaches a threshold value, otherwise, deleting the S label only;
and 4, obtaining a final title labeling result.
2. The automatic annual report text heading labeling system of claim 1, wherein: the specific steps of the first labeling in the step B are as follows:
step 1, a financial report text and a template are identical in title and marked as M;
step 2, the identical titles cannot be matched in the template, and a title with the highest similarity is found out from the template through similarity calculation and is marked as S;
and 3, after the step 2 is completed, the condition that part of titles with different levels can be marked is generated, the marks of the titles with different levels are removed, and the patterns of the marked titles M are counted to obtain the patterns of the titles.
3. The automatic annual report text heading labeling system of claim 2, wherein: the second labeling in step B will be to add the unsatisfactory title in the S label to the partial title by computing the similarity.
4. The automatic annual report text heading labeling system of claim 1, wherein: the title template covers only the financial reports.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910416616.XA CN110287458B (en) | 2019-05-20 | 2019-05-20 | Automatic annual newspaper text title labeling system |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910416616.XA CN110287458B (en) | 2019-05-20 | 2019-05-20 | Automatic annual newspaper text title labeling system |
Publications (2)
Publication Number | Publication Date |
---|---|
CN110287458A CN110287458A (en) | 2019-09-27 |
CN110287458B true CN110287458B (en) | 2023-05-02 |
Family
ID=68002115
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910416616.XA Active CN110287458B (en) | 2019-05-20 | 2019-05-20 | Automatic annual newspaper text title labeling system |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110287458B (en) |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111026849B (en) * | 2019-12-17 | 2023-09-19 | 北京百度网讯科技有限公司 | Data processing method and device |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102780647A (en) * | 2012-07-21 | 2012-11-14 | 上海量明科技发展有限公司 | Method, client and system for implementing mind map function by instant messaging tool |
CN105045847A (en) * | 2015-07-01 | 2015-11-11 | 广州市万隆证券咨询顾问有限公司 | Method for extracting Chinese institutional unit name from text information |
CN105809458A (en) * | 2014-12-29 | 2016-07-27 | 苏宁云商集团股份有限公司 | Advertisement accurate delivery method and system in e-commerce site |
CN109101538A (en) * | 2018-06-29 | 2018-12-28 | 中译语通科技股份有限公司 | A kind of entity abstracting method and system towards Chinese patent text |
CN109766524A (en) * | 2018-12-28 | 2019-05-17 | 重庆邮电大学 | A kind of merger & reorganization class notice information abstracting method and system |
-
2019
- 2019-05-20 CN CN201910416616.XA patent/CN110287458B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102780647A (en) * | 2012-07-21 | 2012-11-14 | 上海量明科技发展有限公司 | Method, client and system for implementing mind map function by instant messaging tool |
CN105809458A (en) * | 2014-12-29 | 2016-07-27 | 苏宁云商集团股份有限公司 | Advertisement accurate delivery method and system in e-commerce site |
CN105045847A (en) * | 2015-07-01 | 2015-11-11 | 广州市万隆证券咨询顾问有限公司 | Method for extracting Chinese institutional unit name from text information |
CN109101538A (en) * | 2018-06-29 | 2018-12-28 | 中译语通科技股份有限公司 | A kind of entity abstracting method and system towards Chinese patent text |
CN109766524A (en) * | 2018-12-28 | 2019-05-17 | 重庆邮电大学 | A kind of merger & reorganization class notice information abstracting method and system |
Also Published As
Publication number | Publication date |
---|---|
CN110287458A (en) | 2019-09-27 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN114610515B (en) | Multi-feature log anomaly detection method and system based on log full semantics | |
US11756323B2 (en) | Method of automatically recognizing and classifying design information in imaged PID drawing and method of automatically creating intelligent PID drawing using design information stored in database | |
JP3842006B2 (en) | Form classification device, form classification method, and computer-readable recording medium storing a program for causing a computer to execute these methods | |
CN109145260B (en) | Automatic text information extraction method | |
CN107122342B (en) | Text code recognition method and device | |
CN110175334B (en) | Text knowledge extraction system and method based on custom knowledge slot structure | |
CN111581345A (en) | Document level event extraction method and device | |
CN107169321B (en) | Program plagiarism detection method and system based on combination of attribute counting and structure measurement technology | |
CN110765231A (en) | Chapter event extraction method based on common-finger fusion | |
CN110287458B (en) | Automatic annual newspaper text title labeling system | |
CN112257413A (en) | Address parameter processing method and related equipment | |
CN114154484B (en) | Construction professional term library intelligent construction method based on mixed depth semantic mining | |
CN110795607A (en) | Equipment guarantee data matching method and system based on multi-stage similarity calculation | |
CN107133201B (en) | Hot spot information acquisition method and device based on text code recognition | |
CN111026743B (en) | Rail transit engineering project structure data standardization method | |
CN111754352A (en) | Method, device, equipment and storage medium for judging correctness of viewpoint statement | |
CN115687790B (en) | Advertisement pushing method and system based on big data and cloud platform | |
CN111291535A (en) | Script processing method and device, electronic equipment and computer readable storage medium | |
CN108073678A (en) | Applied to document analyzing and processing method, system and the device in big data analysis | |
CN114595661B (en) | Method, apparatus, and medium for reviewing bid document | |
CN116522872A (en) | Similarity calculation-based metadata field Chinese name completion method, storage medium and system | |
CN113609864B (en) | Text semantic recognition processing system and method based on industrial control system | |
CN114611515B (en) | Method and system for identifying enterprise actual control person based on enterprise public opinion information | |
CN109993381B (en) | Demand management application method, device, equipment and medium based on knowledge graph | |
CN115294593A (en) | Image information extraction method and device, computer equipment and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |