CN110287458A - A kind of annual report text header automatic marking system - Google Patents
A kind of annual report text header automatic marking system Download PDFInfo
- Publication number
- CN110287458A CN110287458A CN201910416616.XA CN201910416616A CN110287458A CN 110287458 A CN110287458 A CN 110287458A CN 201910416616 A CN201910416616 A CN 201910416616A CN 110287458 A CN110287458 A CN 110287458A
- Authority
- CN
- China
- Prior art keywords
- title
- level
- mark
- template
- report text
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/22—Matching criteria, e.g. proximity measures
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/10—Text processing
- G06F40/103—Formatting, i.e. changing of presentation of documents
- G06F40/117—Tagging; Marking up; Designating a block; Setting of attributes
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/10—Text processing
- G06F40/166—Editing, e.g. inserting or deleting
- G06F40/186—Templates
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/205—Parsing
- G06F40/216—Parsing using statistical methods
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/258—Heading extraction; Automatic titling; Numbering
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q40/00—Finance; Insurance; Tax strategies; Processing of corporate or income taxes
- G06Q40/12—Accounting
- G06Q40/125—Finance or payroll
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Artificial Intelligence (AREA)
- General Engineering & Computer Science (AREA)
- General Health & Medical Sciences (AREA)
- Computational Linguistics (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Business, Economics & Management (AREA)
- Health & Medical Sciences (AREA)
- Finance (AREA)
- Accounting & Taxation (AREA)
- Data Mining & Analysis (AREA)
- Strategic Management (AREA)
- General Business, Economics & Management (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Biology (AREA)
- Evolutionary Computation (AREA)
- Life Sciences & Earth Sciences (AREA)
- Technology Law (AREA)
- Marketing (AREA)
- Economics (AREA)
- Development Economics (AREA)
- Probability & Statistics with Applications (AREA)
- Machine Translation (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a kind of annual report text header automatic marking system, method particularly includes: A, the level-one title and second level title of different stage will be marked twice;B, level-one title is matched, it is marked for the first time by complete and title template matching, and M, S mark are not added by the level-one title that similarity calculation reaches threshold value, and leapfrog mark is screened, the title marked by similarity calculation is carried out Secondary Match and is related to annual report text header label technology field by second of mark.The annual report text header automatic marking system, invention selection identifies financial report text layout using the method for machine vision, and machine vision and rule-statistical text abstracting method combine, to solve the problems, such as more difficult to be accurately labeled title.
Description
Technical field
The present invention relates to annual report text header label technology field, specially a kind of annual report text header automatic marking system
System.
Background technique
Financial report text has stringent normalization and careful logical construction, and chapters and sections paragraph contains abundant drape over one's shoulders
Reveal information, identifies the feature that the pattern of font in PDF is extracted as title in financial report, sufficiently find depth after analysis
Parsing PDF format can be realized the identification of PDF Chinese text, however there is no stringent unified moulds by the PDF reported and submitted due to enterprise
Version requires the format informations such as font, so even if obtaining the font of each level title in PDF, it is also difficult to by these font informations
It is unified into a decimation rule.
Always which can directly be judged according to certain priori knowledge when people visually contact financial report text
It is title, which is body text, and also just when reading an annual report, people can be according to the disclosure mould of annual report standardization
Formula judges the structure of an article and summary content of PDF financial report.
The present inventor is inspired by this, and in conjunction with principles of mathematical morphology, the present invention identifies the annual report summary of view-based access control model
Problem be converted into based on mathematical morphology filtering behavior, consider real operation environment, the present invention selection use machine vision
Method financial report text layout is identified, and machine vision and rule-statistical text abstracting method are organically combined
Come, to solve the problems, such as more difficult to be accurately labeled title.
Summary of the invention
(1) the technical issues of solving
In view of the deficiencies of the prior art, the present invention provides a kind of annual report text header automatic marking system, solve compared with
The problem of hardly possible is accurately labeled title.
(2) technical solution
In order to achieve the above object, the present invention is achieved by the following technical programs: a kind of annual report text header is marked automatically
Injection system, method particularly includes:
A, the level-one title of different stage and second level title will be marked twice;
B, level-one title is matched, mark is by complete and title template matching for the first time, and reaches by similarity calculation
The level-one title of threshold value does not add M, S mark, and screen leapfrog mark, and second mark will be marked by similarity calculation
Title carries out Secondary Match;
C, for second level title, then the matching of all second level titles of its second level title level-one title corresponding with template is calculated
The characteristics of spending, crossing financial report, determines whether the title finally adds mark, for no second level title, then its is neighbouring
Upper one and next title and level-one title corresponding in template match, by judging whether it exactly matches,
Or reach certain similarity value, determine finally whether add mark.
Preferably, the specific steps marked for the first time in step B are as follows:
Step 1, financial report text and the identical title of template are labeled as M.
Step 2 can not match duplicate title in template, and by similarity calculation, one is found out in template
The highest title of similarity is labeled as S.
After step 3, completion step 2, it will the situation that part different stage title can be labeled occur, remove different stage
The label of title crosses the pattern for counting the title of labeled M, obtains the pattern of title.
Preferably, it marks to add in S label by calculating similarity to division header for the second time in step B and not meet
It is required that title screened.
Preferably, second of mark specific steps are as follows:
Step 1, reading are labeled as the title content of S;
Step 2 obtains the second level title quantity total that current head corresponds to title in template;
If step 3, total are equal to 0, it can determine whether that the level-one title of its context whether there is in template, if be located at phase
Like position, if it is, S mark will be deleted, addition M mark thinks that it meets matching and requires;
If total is not equal to 0, the similarity of all second level titles of current head second level title corresponding with template is calculated.
If similarity value reaches threshold value, S mark is deleted, otherwise addition M mark only deletes S mark;
Step 4 obtains final title annotation results.
Preferably, the range of the title template covering is only in financial report.
(3) beneficial effect
The present invention provides a kind of annual report text header automatic marking systems.Have it is following the utility model has the advantages that
(1), annual report text header automatic marking system, by proposing a kind of consideration title based on Text similarity computing
The title mask method of contextual information, it is contemplated that annual report title as a kind of short text (such as: tax item, the mark of only one word
Topic), directly accurately title is labeled by the way that cosine coefficient similarity calculating method is more difficult, and financial report disclosure content
Often there is the disclosure order of preferable contextual information requirement and specification, in conjunction with the feature, considers the context of annual report title
Information incorporated consideration, then goes to calculate the similarity of line of text in title and template, can accurately be mapped to title
Corresponding template row, is labeled annual report using template to realize, can be in conjunction with the title mask method of contextual information
Greatly improve the range and accuracy of mark.
Detailed description of the invention
Fig. 1 is the preliminary matches title flow chart that the present invention marks for the first time;
Fig. 2 is the title flow chart of the screen fraction label of second of mark of the present invention;
Specific embodiment
Following will be combined with the drawings in the embodiments of the present invention, and technical solution in the embodiment of the present invention carries out clear, complete
Site preparation description, it is clear that described embodiments are only a part of the embodiments of the present invention, instead of all the embodiments.It is based on
Embodiment in the present invention, it is obtained by those of ordinary skill in the art without making creative efforts every other
Embodiment shall fall within the protection scope of the present invention.
The embodiment of the present invention provides a kind of technical solution referring to FIG. 1-2: a kind of annual report text header automatic marking system
System, method particularly includes:
A, the level-one title of different stage and second level title will be marked twice;
B, level-one title is matched, mark is by complete and title template matching for the first time, and reaches by similarity calculation
The level-one title of threshold value does not add M, S mark, and screen leapfrog mark, and second mark will be marked by similarity calculation
Title carries out Secondary Match;
C, for second level title, then the matching of all second level titles of its second level title level-one title corresponding with template is calculated
The characteristics of spending, crossing financial report, determines whether the title finally adds mark, for no second level title, then its is neighbouring
Upper one and next title and level-one title corresponding in template match, by judging whether it exactly matches,
Or reach certain similarity value, determine finally whether add mark;
Further, the specific steps marked for the first time in step B are as follows:
Step 1, financial report text and the identical title of template are labeled as M;
Step 2 can not match duplicate title in template, and by similarity calculation, one is found out in template
The highest title of similarity is labeled as S;
After step 3, completion step 2, it will the situation that part different stage title can be labeled occur, remove different stage
The label of title crosses the pattern for counting the title of labeled M, obtains the pattern of title;
Further, second of mark will not be inconsistent by calculating similarity and adding in S label to division header in step B
Desired title is closed to be screened;
Further, second of mark specific steps are as follows:
Step 1, reading are labeled as the title content of S;
Step 2 obtains the second level title quantity total that current head corresponds to title in template;
If step 3, total are equal to 0, it can determine whether that the level-one title of its context whether there is in template, if be located at phase
Like position, if it is, S mark will be deleted, addition M mark thinks that it meets matching and requires;
If total is not equal to 0, the similarity of all second level titles of current head second level title corresponding with template is calculated.
If similarity value reaches threshold value, S mark is deleted, otherwise addition M mark only deletes S mark;
Step 4 obtains final title annotation results;
Further, for the range of title template covering only in financial report, the present invention operates mainly in annual report analysis
" financial report " module, so the range of title template covering, only in " financial report " module, in annual report, " financial report " belongs to
In level-one title, under include second level title, three-level title;For title template, level-one title is equal to " finance report
Second level title in announcement ", similarly, second level title is equal to the three-level title in " financial report ";Title template is manually to be divided
Analyse and extract after the annual report of 100 every field, final title annotation results show the spreadability of the template preferably, mark
Topic matching rate is higher, and the matching degree of level-one title and second level title can reach 80% or more.
The logical relation between different stage title is embodied in order to balance and improves two aspect of practicability of coding, title template
The rule of coding is as follows:
Level-one title: with " 10000 " beginning, it is incremented by with " 10000 " for unit, i.e. angle of the level-one heading code from mathematics
It sees, having is the distinguishing feature that can be divided exactly by " 10000 ";
Second level title: it is encoded since adding " 100 " in the basis of coding of the level-one title belonging to it, is with " 100 "
Incremented, i.e. for second level heading code from the point of view of mathematics, having is that can be divided exactly by " 100 " and cannot be whole by " 10000 "
The distinguishing feature removed.
It should be noted that, in this document, relational terms such as first and second and the like are used merely to a reality
Body or operation are distinguished with another entity or operation, are deposited without necessarily requiring or implying between these entities or operation
In any actual relationship or order or sequence.Moreover, the terms "include", "comprise" or its any other variant are intended to
Non-exclusive inclusion, so that the process, method, article or equipment including a series of elements is not only wanted including those
Element, but also including other elements that are not explicitly listed, or further include for this process, method, article or equipment
Intrinsic element.In the absence of more restrictions.
It although an embodiment of the present invention has been shown and described, for the ordinary skill in the art, can be with
A variety of variations, modification, replacement can be carried out to these embodiments without departing from the principles and spirit of the present invention by understanding
And modification, the scope of the present invention is defined by the appended.
Claims (5)
1. a kind of annual report text header automatic marking system, method particularly includes:
A, the level-one title of different stage and second level title will be marked twice;
B, level-one title is matched, mark is by complete and title template matching for the first time, and reaches threshold value by similarity calculation
Level-one title do not add M, S mark, and screen leapfrog mark, mark the title that will be marked by similarity calculation for the second time
Carry out Secondary Match;
C, for second level title, then the matching degree of all second level titles of its second level title level-one title corresponding with template is calculated,
The characteristics of crossing financial report, determines whether the title finally adds mark, for no second level title, then its is neighbouring upper
One and next title match with level-one title corresponding in template, by judging whether it exactly matches, or
Reach certain similarity value, determines finally whether add mark.
2. a kind of annual report text header automatic marking system according to claim 1, it is characterised in that: first in step B
The specific steps of secondary mark are as follows:
Step 1, financial report text and the identical title of template are labeled as M.
Step 2 can not match duplicate title in template, by similarity calculation, found out in template one it is similar
Highest title is spent, S is labeled as.
After step 3, completion step 2, it will the situation that part different stage title can be labeled occur, remove different stage title
Label, cross the pattern for counting the title of labeled M, obtain the pattern of title.
3. a kind of annual report text header automatic marking system according to claim 2, it is characterised in that: second in step B
Secondary mark will be screened by calculating similarity to undesirable title in division header addition S label.
4. a kind of annual report text header automatic marking system according to claim 3, it is characterised in that: second deutero-albumose
Infuse specific steps are as follows:
Step 1, reading are labeled as the title content of S;
Step 2 obtains the second level title quantity total that current head corresponds to title in template;
If step 3, total are equal to 0, it can determine whether that the level-one title of its context whether there is in template, if be located at similar position
It sets, if it is, S mark will be deleted, addition M mark thinks that it meets matching and requires;
If total is not equal to 0, the similarity of all second level titles of current head second level title corresponding with template is calculated.If
Similarity value reaches threshold value, then deletes S mark, otherwise addition M mark only deletes S mark;
Step 4 obtains final title annotation results.
5. a kind of annual report text header automatic marking system according to claim 1, it is characterised in that: the title template
The range of covering is only in financial report.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910416616.XA CN110287458B (en) | 2019-05-20 | 2019-05-20 | Automatic annual newspaper text title labeling system |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910416616.XA CN110287458B (en) | 2019-05-20 | 2019-05-20 | Automatic annual newspaper text title labeling system |
Publications (2)
Publication Number | Publication Date |
---|---|
CN110287458A true CN110287458A (en) | 2019-09-27 |
CN110287458B CN110287458B (en) | 2023-05-02 |
Family
ID=68002115
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910416616.XA Active CN110287458B (en) | 2019-05-20 | 2019-05-20 | Automatic annual newspaper text title labeling system |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110287458B (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111026849A (en) * | 2019-12-17 | 2020-04-17 | 北京百度网讯科技有限公司 | Data processing method and device |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102780647A (en) * | 2012-07-21 | 2012-11-14 | 上海量明科技发展有限公司 | Method, client and system for implementing mind map function by instant messaging tool |
CN105045847A (en) * | 2015-07-01 | 2015-11-11 | 广州市万隆证券咨询顾问有限公司 | Method for extracting Chinese institutional unit name from text information |
CN105809458A (en) * | 2014-12-29 | 2016-07-27 | 苏宁云商集团股份有限公司 | Advertisement accurate delivery method and system in e-commerce site |
CN109101538A (en) * | 2018-06-29 | 2018-12-28 | 中译语通科技股份有限公司 | A kind of entity abstracting method and system towards Chinese patent text |
CN109766524A (en) * | 2018-12-28 | 2019-05-17 | 重庆邮电大学 | A kind of merger & reorganization class notice information abstracting method and system |
-
2019
- 2019-05-20 CN CN201910416616.XA patent/CN110287458B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102780647A (en) * | 2012-07-21 | 2012-11-14 | 上海量明科技发展有限公司 | Method, client and system for implementing mind map function by instant messaging tool |
CN105809458A (en) * | 2014-12-29 | 2016-07-27 | 苏宁云商集团股份有限公司 | Advertisement accurate delivery method and system in e-commerce site |
CN105045847A (en) * | 2015-07-01 | 2015-11-11 | 广州市万隆证券咨询顾问有限公司 | Method for extracting Chinese institutional unit name from text information |
CN109101538A (en) * | 2018-06-29 | 2018-12-28 | 中译语通科技股份有限公司 | A kind of entity abstracting method and system towards Chinese patent text |
CN109766524A (en) * | 2018-12-28 | 2019-05-17 | 重庆邮电大学 | A kind of merger & reorganization class notice information abstracting method and system |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111026849A (en) * | 2019-12-17 | 2020-04-17 | 北京百度网讯科技有限公司 | Data processing method and device |
CN111026849B (en) * | 2019-12-17 | 2023-09-19 | 北京百度网讯科技有限公司 | Data processing method and device |
Also Published As
Publication number | Publication date |
---|---|
CN110287458B (en) | 2023-05-02 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2021042521A1 (en) | Contract automatic generation method, computer device and computer non-volatile storage medium | |
CN102067128A (en) | Data processing device, data processing method, program, and integrated circuit | |
CN104809117A (en) | Video data aggregation processing method, aggregation system and video searching platform | |
CN106897364B (en) | Chinese reference corpus construction method based on events | |
CN103823824A (en) | Method and system for automatically constructing text classification corpus by aid of internet | |
CN108763591A (en) | A kind of webpage context extraction method, device, computer installation and computer readable storage medium | |
CN110941720A (en) | Knowledge base-based specific personnel information error correction method | |
CN106339481A (en) | Chinese compound new-word discovery method based on maximum confidence coefficient | |
CN103473217A (en) | Method and device for extracting keywords from text | |
WO2022089227A1 (en) | Address parameter processing method, and related device | |
CN110378911A (en) | Weakly supervised image, semantic dividing method based on candidate region and neighborhood classification device | |
CN107967494A (en) | A kind of image-region mask method of view-based access control model semantic relation figure | |
CN110134781A (en) | A kind of automatic abstracting method of finance text snippet | |
CN105678244B (en) | A kind of near video search method based on improved edit-distance | |
CN110287458A (en) | A kind of annual report text header automatic marking system | |
CN105955960B (en) | Grounding grid defect text mining method based on semantic frame | |
CN112148735A (en) | Construction method for structured form data knowledge graph | |
CN115113919A (en) | Software scale measurement intelligent informatization system based on BERT model and Web technology | |
CN109543712A (en) | Entity recognition method on temporal dataset | |
CN109325159A (en) | A kind of microblog hot event method for digging | |
CN110516069B (en) | Fasttext-CRF-based quotation metadata extraction method | |
CN114495138A (en) | Intelligent document identification and feature extraction method, device platform and storage medium | |
CN112435108A (en) | Method and system for managing storage of sodium system and electronic equipment | |
CN105426388A (en) | Apparatus for extracting and comparing webpage text | |
CN109657684A (en) | A kind of image, semantic analytic method based on Weakly supervised study |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |